Pattern for Frequently Updated Sorted Data - sorting

Let's say you're composing a blogging website. It displays recent blog posts by multiple authors sorted by "priority". Highest priority on top. Priority is determined by some formula involving:
how recently the post was published
how many comments it attracted
Order must always be accurate in real-time.
Sorting by priority is easy. The problem is let's say our website is hugely popular and comments fly in at the hundreds-per-minute rate. They fly in on dozens of posts.
Is there a pattern to handle this scenario? In other words, can we do any better than just updating the priority field whenever there's a comment on a post, and then sorting posts each and every time the page is loaded? Caching post order doesn't help much because heavy user activity causes order to change frequently.
With "pattern" I'm speaking from both a code and database schema point of view.

You can use a balanced binary tree (e.g. a red-black tree) to store the sorted index, which should make it quicker to update than if you were sorting the entire index every time.
Using Java-ish pseudocode this would look like
Tree tree;
Node {
int priority;
incrementPriority() {
priority = priority + 1;
if(priority > tree.nextHighestNode(this)) {
tree.remove(this);
tree.add(this);
}
}
decrementPriority() {
priority = priority - 1;
if(priority < tree.nextLowestNode(this)) {
tree.remove(this);
tree.add(this);
}
}
}
If changing a node's priority means that it's in an invalid tree location (meaning that it is higher than what ought to be the next-highest node, or lower than what ought to be the next-lowest node), then it's removed and re-added to the tree (which takes care of rebalancing itself). Insertion is O(log(n)), but usually (when there's no insertions/removals) updating the priority is a constant time operation.
Red-black trees are how balanced binary trees are usually implemented, but there are alternatives e.g. a Tango tree is probably more appropriate here since it's an online implementation. The biggest problem is going to be with concurrency - ideally you would want to be able to implement the nodes' priority fields using some sort of AtomicInteger (permits atomic increments and decrements; quite a few languages have something like this) so that you won't need to lock the field each time you change it, but it will be difficult to atomically compare the priority to the adjacent nodes' priorities.
As an alternative, you can store everything in an array or a linked list and swap adjacent elements when their priorities change - this way you won't need to do a full sort each time, and unlike the balanced binary tree where removing and inserting an element is O(log(n)), swapping two adjacent array/list elements is constant time. The only problem is that adding an entirely new element will be costly with an array as you will need to shift all of the array's elements; it will be O(n) with a list as well because you'll need to traverse the list until you find the correct location to insert the item, but this is probably preferable to the array because you won't need to shift any adjacent elements (which will reduce the amount of locking you need to do).

Related

Data structure / algorithm to find current position of an identifier in array following many insertions

Summary
I have a problem that seems as if it is likely to have many good known solutions with different performance characteristics. Assuming this is the case, a great answer could include code (or pseudo code), but ideally, it would reference the literature and give them name of this problem so I can explore the problem in greater detail and understand the space of solutions.
The problem
I have an initially empty array that can hold identifiers. Any identifiers in the array are unique (they're only ever in the array once).
var identifiers: [Identifier] = []
Over time a process will insert identifiers in to the array. It won't just append them to the end. They can be inserted anywhere in the array.
Identifiers will never be removed from the array – only inserted.
Insertion needs to be quick, so probably the structure won't literally be an array, but something supporting better than linear time insertion (such as a BTree).
After some identifiers have been added to identifiers, I want to be able to query the current position of any given identifier. So I need a function to look this up.
A linear time solution to this is simply to scan through identifiers from the start until an identifier is found, and the answer is the index that was reached.
func find(identifier: Identifier) -> Int? {
for index in identifiers.indices {
if identifiers[index] == identifier {
return index
}
}
return nil
}
But this linear scaling with the size of the array is problematic if the array is very large (perhaps 100s of millions of elements).
A hash map doesn't work
We can't put the positions of the identifiers in to a hash map. This doesn't work because identifier positions are not fixed after insertion in to the array. If other identifiers are inserted before them, they will drift to higher indexes.
However, a possible acceleration for the linear time algorithm would be to cache the initial insertions position of an identifier and begin a linear scan from there. Because identifiers are only inserted, it must be at that index, or an index after it (or not in identifiers at all). Once the identifier is found, the cache can be updated.
Another option could be to update the hash-map after any insertions to correct any positions. However this would slow down insert so that it is potentially a linear time operation (as previously mentioned, identifiers is probably not a literally array but some other structure allowing better than linear time insertion).
Summary
There's a linear time solution, and there's an optimisation using a hash map (at the cost of roughly doubling storage). Is there a much better solution for looking up the current index of an identifier, perhaps in log time?
You can use an order-statistic tree for this, based on a red-black tree or other self-balancing binary tree. Essentially, each node will have an extra integer field, storing the number of descendants it currently has. (Insertion and deletion operations, and their resultant rotations, will only result in updating O(log n) nodes so this can be done efficiently). To query the position of an identifier, you examine the descendant count of its left subtree and the descendant counts of the siblings of each of its right-side ancestors.
Note that while the classic use of an order-statistic tree is for a sorted list, nothing in the logic requires this; "node A is before node B" is all you need for insertion, tree rotations don't require re-evaluating the ordering, and having a pointer to the node (and having nodes store parent pointers) is all you need to query the index. All operations are in O(log n).

What are some good data structures to store a large orderbook?

I'm writing a Bitcoin trader app that is fetching orderbooks from exchanges. A typical orderbook looks like this: https://www.bitstamp.net/api/order_book/ (it has two parts, "bids" and "asks", they should be stored separately but in identical data structures). One solution would be to store only part of this large orderbook, that would solve the access efficiency problem, but it introduces a set of other problems that have to do with consistency and update restrictions. So for now, it seems that a better solution is to fetch an orderbook and keep updating it.
Now this trader app later updates this fetched orderbook with new orders and with removed orders. For example, if you have an order at $900 to buy 1.5BTC in an orderbook, it may be cancelled completely or it may be updated to either contain more or less BTC. Also, a new order may be added below or above that price.
There are two critical operations:
quickly find an order with exactly the same price (in the case of
an update or cancelling)
quickly find an order with the price closest to the one provided, but below it
In the case of an update we may not actually know it is an update, so we may start doing (2) and end up doing (1).
I'm not an expert in data structures, so I started looking through most common ones and for now I have a sense it should be some kind of tree, but I'm not sure which one. My most uneducated guess would be a data structure in which each node is a digit in a price, so, for example, to quickly find all nodes with the price of $900 we do items['9']['0'] and then look for leaf nodes. It's still a mess in my head for now, so please don't judge me too harsh. Any advice would be great.
It sounds like you want a simple binary search tree (BST): (well, a self-balancing one)
A binary search tree (BST) is a node-based binary tree data structure which has the following properties:
The left subtree of a node contains only nodes with keys less than the node's key.
The right subtree of a node contains only nodes with keys greater than the node's key.
The left and right subtree each must also be a binary search tree.
There must be no duplicate nodes (an easy constraint to get around though, if need be).
A BST allows you to efficiently do both of your operations - to find an element matching some value, or one where the value is closest, but smaller.
The running time of both of these operations are O(log n), and, more specifically, the number of comparisons are quite close to log2n, which is around 12 for n = 5000, which is pretty much nothing (and there's a bit of work to rebalance the tree, but that should be a similar amount of work).

Finding proper data structure c++

I was looking for some simple implemented data structure which gets my needs fulfilled in least possible time (in worst possible case) :-
(1)To pop nth element (I have to keep relative order of elements intact)
(2)To access nth element .
I couldn't use array because it can't pop and i dont want to have a gap after deleting ith element . I tried to remove the gap , by exchanging nth element with next again with next untill last but that proves time ineffecient though array's O(1) is unbeatable .
I tried using vector and used 'erase' for popup and '.at()' for access , but even this is not cheap for time effeciency though its better than array .
What you can try is skip list - it support the operation you are requesting in O(log(n)). Another option would be tiered vector that is just slightly easier to implement and takes O(sqrt(n)). both structures are quite cool but alas not very popular.
Well , tiered vector implemented on array would i think best fit your purpose . Though the tiered vector concept may be knew and little tricky to understand at first but then once you get it , it opens lot of question and you get a handy weapon to tackle many question's data structure part very effeciently . So it is recommended that you master tiered vectors implementation.
An array will give you O(1) lookup but O(n) delete of the element.
A list will give you O(n) lookup bug O(1) delete of the element.
A binary search tree will give you O(log n) lookup with O(1) delete of the element. But it doesn't preserve the relative order.
A binary search tree used in conjunction with the list will give you the best of both worlds. Insert a node into both the list (to preserve order) and the tree (fast lookup). Delete will be O(1).
struct node {
node* list_next;
node* list_prev;
node* tree_right;
node* tree_left;
// node data;
};
Note that if the nodes are inserted into the tree using the index as the sort value, you will end up with another linked list pretending to be a tree. The tree can be balanced however in O(n) time once it is built which you would only have to incur once.
Update
Thinking about this more this might not be the best approach for you. I'm used to doing lookups on the data itself not its relative position in a set. This is a data centric approach. Using the index as the sort value will break as soon as you remove a node since the "higher" indices will need to change.
Warning: Don't take this answer seriously.
In theory, you can do both in O(1). Assuming this are the only operations you want to optimize for. The following solution will need lots of space (and it will leak space), and it will take long to create the data structure:
Use an array. In every entry of the array, point to another array which is the same, but with that entry removed.

Binary Search Tree for specific intent

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.
I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:
I want to insert, search, remove values in O(log n) (balanced tree).
I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
It might be useful to delete several values in a row, before balancing the tree
I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)
Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?
More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?
It is possible to delete a range of values a BST in O(logn + objects num).
The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on).
In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).
The range deletion operation can be done according to the following algorithm:
Find the first element in the range - O(logn)
Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)
The total complexity of the range delete is O(logn + number of objects in the range).
Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.
A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.
Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.
In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.
The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.
Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.
To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.
To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.
Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.
It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.
The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources