Data structure for non overlapping ranges of integers? - data-structures

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?

I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.

You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Related

What is the most optimal data structure for storing a partially-ordered set (poset)?

I am looking for a data structure for storing a poset, that supports the following operations with good big-Oh complexity (these are done frequently):
determining if a given element can be consistently ordered before another one (top priority above others)
adding new ordering constraints
returning all elements that can be ordered before a given element
The data structure must also support generating a totally ordered set from the poset, but that only needs to be done once for the output after the algorithm has already run.
So far I have been relying on a naive poset implementation, but I am reaching a point where that is no longer sustainable.
I have looked at https://cstheory.stackexchange.com/questions/8503/what-persistent-data-structure-for-a-set-of-partially-ordered-elements, and there are some interesting papers in the answers, but most of these data structures (like ChainMerge) seem to be optimized with sorting in mind, where for me it is the least important step.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

Hash table vs Balanced binary tree [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?
This question cannot be answered, in general, I fear.
The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.
So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.
For a more elaborate answer, let's consider some alternatives.
Hash Table (see Wikipedia's entry for some basics)
Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...
Binary Tree
Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"
Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...
Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.
Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.
A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.
In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.
On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.
Binary trees are easier to implement in purely functional languages.
Binary trees have a natural sort order and a natural way to walk the tree for all elements.
When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.
Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).
Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.
Hash tables are faster lookups:
You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).
Binary trees:
Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.
A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.
If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.
The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.
Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.
These are factors to take into account in deciding which implementation to use:
Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.
If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.
To add to the other great answers above, I'd say:
Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.
One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:
def bar(table):
# some intern stuck this line of code in
table["hello"] = "world"
return table["the answer"]
def foo(x, y, table):
z = bar(table)
if "hello" in table:
raise Exception("failed catastrophically!")
return x + y + z
important_result = foo(1, 2, {
"the answer": 5,
"this table": "doesn't contain hello",
"so it should": "be ok"
})
# catastrophic failure occurs
With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.
So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).
This means that an efficient tree can be very convenient when immutable maps are desired.
If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.
In my experience, hastables are always faster because trees suffer too much of cache effects.
To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/
Here you can see compared the performance of the most common hashtable, tree and trie libraries available.
One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources