What is the proper data structure to store self-sorting list with repeating keys? - sorting

I need something that will work in O(log(n)) complexity, and I thought about AVL trees, but the problem is that some keys may repeat themselves (score of a person for example), so I can't think of how to implement it as a tree. What is a proper way to do this?

There are many options available. Most flavors of binary search trees can easily be modified to allow for nodes with duplicated values, since the balancing operations (usually) purely consist of rotations, which keep the sequence in order. For cases like these, you'd just do a normal BST insertion, but every time you see a duplicated value, you just arbitrarily move to the left or the right and continue as if the value were distinct.
Skiplists are particularly easy to update to support multiple copies of each key, since they don't do any complicated structural updates on insertions or deletions.
If you don't have auxiliary information associated with each key, then another simpler option would be to store a standard binary search tree, but to augment each node with a "count" field indicating how many logical copies of that field exist. Every time you do an insertion, if the key doesn't exist, you create it with count 1. If it already exists, you just increment the count in the existing node. Deletions would be implemented analogously.
Of course, if you don't want to roll your own data structure, just go and find a good implementation of a multimap or multiset, which should get the job done for you quite nicely. Depending on your Programming Language of Choice, you might even find these in the standard libraries. :-)

Related

What are some good data structures to store a large orderbook?

I'm writing a Bitcoin trader app that is fetching orderbooks from exchanges. A typical orderbook looks like this: https://www.bitstamp.net/api/order_book/ (it has two parts, "bids" and "asks", they should be stored separately but in identical data structures). One solution would be to store only part of this large orderbook, that would solve the access efficiency problem, but it introduces a set of other problems that have to do with consistency and update restrictions. So for now, it seems that a better solution is to fetch an orderbook and keep updating it.
Now this trader app later updates this fetched orderbook with new orders and with removed orders. For example, if you have an order at $900 to buy 1.5BTC in an orderbook, it may be cancelled completely or it may be updated to either contain more or less BTC. Also, a new order may be added below or above that price.
There are two critical operations:
quickly find an order with exactly the same price (in the case of
an update or cancelling)
quickly find an order with the price closest to the one provided, but below it
In the case of an update we may not actually know it is an update, so we may start doing (2) and end up doing (1).
I'm not an expert in data structures, so I started looking through most common ones and for now I have a sense it should be some kind of tree, but I'm not sure which one. My most uneducated guess would be a data structure in which each node is a digit in a price, so, for example, to quickly find all nodes with the price of $900 we do items['9']['0'] and then look for leaf nodes. It's still a mess in my head for now, so please don't judge me too harsh. Any advice would be great.
It sounds like you want a simple binary search tree (BST): (well, a self-balancing one)
A binary search tree (BST) is a node-based binary tree data structure which has the following properties:
The left subtree of a node contains only nodes with keys less than the node's key.
The right subtree of a node contains only nodes with keys greater than the node's key.
The left and right subtree each must also be a binary search tree.
There must be no duplicate nodes (an easy constraint to get around though, if need be).
A BST allows you to efficiently do both of your operations - to find an element matching some value, or one where the value is closest, but smaller.
The running time of both of these operations are O(log n), and, more specifically, the number of comparisons are quite close to log2n, which is around 12 for n = 5000, which is pretty much nothing (and there's a bit of work to rebalance the tree, but that should be a similar amount of work).

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

When to use a treap

Can anyone provide real examples of when is the best way to store your data is treap?
I want to understand in which situations treap will be better than heaps and tree structures.
If it's possible, please provide some examples from real situations.
I've tried to search cases of using treaps here and by googling, but did not find anything.
Thank you.
If hash values are used as priorities, treaps provide unique representation of the content.
Consider an order set of items implemented as an AVL-tree or rb-tree. Inserting items in different orders will typically end up in trees with different shapes (although all of them are balanced). For a given content a treap will always be of the same shape regardless of history.
I have seen two reasons for why unique representation could be useful:
Security reasons. A treap can not contain information on history.
Efficient sub tree sharing. The fastest algorithms for set operations I have seen use treaps.
I can not provide you any real-world examples. But I do use treaps to solve some problems in programming contests:
http://poj.org/problem?id=2761
http://poj.org/problem?id=3481
These are not actually real problems, but they make sense.
You can use it as a tree-based map implementation. Depending on the application, it could be faster. A couple of years ago I implemented a Treap and a Skip list myself (in Java) just for fun and did some basic benchmarking comparing them to TreeMap, and the Treap was the fastest. You can see the results here.
One of its greatest advantages is that it's very easy to implement, compared to Red-Black trees, for example. However, as far as I remember, it doesn't have a guaranteed cost in its operations (search is O(log n) with high probability), in comparison to Red-Black trees, which means that you wouldn't be able to use it in safety-critical applications where a specific time bound is a requirement.
Treaps are awesome variant of balanced binary search tree. There do exist many algorithms to balance binary trees, but most of them are horrible things with tons of special cases to handle. On the other hand , it is very easy to code Treaps.By making some use of randomness, we have a BBT that is expected to be of logarithmic height.
Some good problems to solve using treaps are --
http://www.spoj.com/problems/QMAX3VN/ ( Easy level )
http://www.spoj.com/problems/GSS6/ ( Moderate level )
Let's say you have a company and you want to create an inventory tool:
Be able to (efficiently) search products by name so you can update the stock.
Get, at any time, the product with the lowest items in stock, so that you are able to plan your next order.
One way to handle these requirements could be by using two different
data structures: one for efficient search by name, for instance, a
hash table, and a priority queue to get the item that most urgently
needs to be resupplied. You have to manage to coordinate those two
data structures and you will need more than twice memory. if we sort
the list of entries according to name, we need to scan the whole list
to find a given value for the other criterion, in this case, the
quantity in stock. Also, if we use a min-heap with the scarcer
products at its top, then we will need linear time to scan the whole
heap looking for a product to update.
Treap
Treap is the blend of tree and heap. The idea is to enforce BST’s
constraints on the names, and heap’s constraints on the quantities.
Product names are treated as the keys of a binary search tree.
The inventory quantities, instead, are treated as priorities of a
heap, so they define a partial ordering from top to bottom. For
priorities, like all heaps, we have a partial ordering, meaning that
only nodes on the same path from the root to leaves are ordered with
respect to their priority. In the above image, you can see that
children nodes always have a higher stock count than their parents,
but there is no ordering between siblings.
Reference
Any subtree in Treap is also a Treap (i.e. satisfies BST rule as well as min- or max- heap rule too). Due to this property, an ordered list can be easily split, or multiple ordered lists can be easily merged using Treaps than using an RB Tree. The implementation is easier. Design is also easier.

Best Data Structure to Store Large Amounts of Data with Dynamic and Non-unique Keys?

Basically, I have a large number of C structs to keep track of, that are essentially:
struct Data {
int key;
... // More data
};
I need to periodically access lots (hundreds) of these, and they must be sorted from lowest to highest key values. The keys are not unique and they will be changed over the course of the program. To make matters even more interesting, the majority of the structures will be culled (based on criteria completely unrelated to the key values) from the pool right before being sorted, but I still need to keep references to them.
I've looked into using a binary search tree to store them, but the keys are not guaranteed to be unique and I'm not entirely sure how to restructure the tree once a key is changed or how to cull specific structures.
To recap in case that was unclear above, I need to:
Store a large number of structures with non-unique and dynamic keys.
Cull a large percentage of the structures (but not free them entirely because different structures are culled each time).
Sort the remaining structures from highest to lowest key value.
What data structure/algorithms would you use to solve this problem? The method needs to be as fast and/or memory efficient as possible, since this is a real-time application.
EDIT: The culling is done by iterating over all of the objects and making a decision for each one. The keys change between the culling/sorting runs. I should have stated that they don't change a lot, but they do change, and they can change multiple times between the culling/sorting runs. (If it helps, the key for each structure is actually a z-order for a Sprite. They need to be sorted before each drawing loop so the Sprites with lower z-orders are drawn first.)
Just stick 'em all in a big array.
When the time comes to do the cull and sort, start by doing the sort. Do an insertion sort. That's right - nothing clever, just an insertion sort.
After the sort, go through the sorted array, and for each object, make the culling decision, then immediately output the object if it isn't culled.
This is about as memory-efficient as it gets. It should also require very little computation: there's no bookkeeping on updates between cull/sort passes, and the sort will be cheap - because insertion sort is adaptive, and for an almost-sorted array like this, it will be almost O(n). The one thing it doesn't do is cache locality: there will be two separate passes over the array, for the sort, and the cull/output.
If you demand more cleverness, then instead of an insertion sort, you could use another adaptive, in-place sort that's faster. Timsort and smoothsort are good candidates; both are utterly fiendish to implement.
The big alternative to this is to only sort unculled objects, using a secondary, temporary, list of such objects which you sort (or keep in a binary tree or whatever). But the thing is, if the keys don't change that much, then the win you get from using an adaptive sort on an almost-sorted array will (i reckon!) outweigh the win you would get from sorting a smaller dataset. It's O(n) vs O(n log n).
The general solution to this type of problem is to use a balanced search tree (e.g. AVL tree, red-black tree, B-tree), which guarantees O(log n) time (almost constant, but not quite) for insertion, deletion, and lookup, where n is the number of items currently stored in the tree. Guaranteeing no key is stored in the tree twice is quite trivial, and is done automatically by many implementations.
If you're working in C++, you could try using std::map<int, yourtype>. If in C, find or implement some simple binary search tree code, and see if it's fast enough.
However, if you use such a tree and find it's too slow, you could look into some more fine-tuned approaches. One might be to put your structs in one big array, radix sort by the integer key, cull on it, then re-sort per pass. Another approach might be to use a Patricia tree.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources