Data Structure for fast position lookup - data-structures

Looking for a datastructure that logically represents a sequence of elements keyed by unique ids (for the purpose of simplicity let's consider them to be strings, or at least hashable objects). Each element can appear only once, there are no gaps, and the first position is 0.
The following operations should be supported (demonstrated with single-letter strings):
insert(id, position) - add the element keyed by id into the sequence at offset position. Naturally, the position of each element later in the sequence is now incremented by one. Example: [S E L F].insert(H, 1) -> [S H E L F]
remove(position) - remove the element at offset position. Decrements the position of each element later in the sequence by one. Example: [S H E L F].remove(2) -> [S H L F]
lookup(id) - find the position of element keyed by id. [S H L F].lookup(H) -> 1
The naïve implementation would be either a linked list or an array. Both would give O(n) lookup, remove, and insert.
In practice, lookup is likely to be used the most, with insert and remove happening frequently enough that it would be nice not to be linear (which a simple combination of hashmap + array/list would get you).
In a perfect world it would be O(1) lookup, O(log n) insert/remove, but I actually suspect that wouldn't work from a purely information-theoretic perspective (though I haven't tried it), so O(log n) lookup would still be nice.

A combination of trie and hash map allows O(log n) lookup/insert/remove.
Each node of trie contains id as well as counter of valid elements, rooted by this node and up to two child pointers. A bit string, determined by left (0) or right (1) turns while traversing the trie from its root to given node, is part of the value, stored in the hash map for corresponding id.
Remove operation marks trie node as invalid and updates all counters of valid elements on the path from deleted node to the root. Also it deletes corresponding hash map entry.
Insert operation should use the position parameter and counters of valid elements in each trie node to search for new node's predecessor and successor nodes. If in-order traversal from predecessor to successor contains any deleted nodes, choose one with lowest rank and reuse it. Otherwise choose either predecessor or successor, and add a new child node to it (right child for predecessor or left one for successor). Then update all counters of valid elements on the path from this node to the root and add corresponding hash map entry.
Lookup operation gets a bit string from the hash map and uses it to go from trie root to corresponding node while summing all the counters of valid elements to the left of this path.
All this allow O(log n) expected time for each operation if the sequence of inserts/removes is random enough. If not, the worst case complexity of each operation is O(n). To get it back to O(log n) amortized complexity, watch for sparsity and balancing factors of the tree and if there are too many deleted nodes, re-create a new perfectly balanced and dense tree; if the tree is too imbalanced, rebuild the most imbalanced subtree.
Instead of hash map it is possible to use some binary search tree or any dictionary data structure. Instead of bit string, used to identify path in the trie, hash map may store pointer to corresponding node in trie.
Other alternative to using trie in this data structure is Indexable skiplist.
O(log N) time for each operation is acceptable, but not perfect. It is possible, as explained by Kevin, to use an algorithm with O(1) lookup complexity in exchange for larger complexity of other operations: O(sqrt(N)). But this can be improved.
If you choose some number of memory accesses (M) for each lookup operation, other operations may be done in O(M*N1/M) time. The idea of such algorithm is presented in this answer to related question. Trie structure, described there, allows easily converting the position to the array index and back. Each non-empty element of this array contains id and each element of hash map maps this id back to the array index.
To make it possible to insert element to this data structure, each block of contiguous array elements should be interleaved with some empty space. When one of the blocks exhausts all available empty space, we should rebuild the smallest group of blocks, related to some element of the trie, that has more than 50% empty space. When total number of empty space is less than 50% or more than 75%, we should rebuild the whole structure.
This rebalancing scheme gives O(MN1/M) amortized complexity only for random and evenly distributed insertions/removals. Worst case complexity (for example, if we always insert at leftmost position) is much larger for M > 2. To guarantee O(MN1/M) worst case we need to reserve more memory and to change rebalancing scheme so that it maintains invariant like this: keep empty space reserved for whole structure at least 50%, keep empty space reserved for all data related to the top trie nodes at least 75%, for next level trie nodes - 87.5%, etc.
With M=2, we have O(1) time for lookup and O(sqrt(N)) time for other operations.
With M=log(N), we have O(log(N)) time for every operation.
But in practice small values of M (like 2 .. 5) are preferable. This may be treated as O(1) lookup time and allows this structure (while performing typical insert/remove operation) to work with up to 5 relatively small contiguous blocks of memory in a cache-friendly way with good vectorization possibilities. Also this limits memory requirements if we require good worst case complexity.

You can achieve everything in O(sqrt(n)) time, but I'll warn you that it's going to take some work.
Start by having a look at a blog post I wrote on ThriftyList. ThriftyList is my implementation of the data structure described in Resizable Arrays in Optimal Time and Space along with some customizations to maintain O(sqrt(n)) circular sublists, each of size O(sqrt(n)). With circular sublists, one can achieve O(sqrt(n)) time insertion/removal by the standard insert/remove-then-shift in the containing sublist followed by a series of push/pop operations across the circular sublists themselves.
Now, to get the index at which a query value falls, you'll need to maintain a map from value to sublist/absolute-index. That is to say, a given value maps to the sublist containing the value, plus the absolute index at which the value falls (the index at which the item would fall were the list non-circular). From these data, you can compute the relative index of the value by taking the offset from the head of the circular sublist and summing with the number of elements which fall behind the containing sublist. To maintain this map requires O(sqrt(n)) operations per insert/delete.

Sounds roughly like Clojure's persistent vectors - they provide O(log32 n) cost for lookup and update. For smallish values of n O(log32 n) is as good as constant....
Basically they are array mapped tries.
Not quite sure on the time complexity for remove and insert - but I'm pretty sure that you could get a variant of this data structure with O(log n) removes and inserts as well.
See this presentation/video: http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Source code (Java): https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java

Related

Data structure that supports random access by index and key, insertion, deletion in logaritmic time with order maintained

I'm looking for the data structure that stores an ordered list of E = (K, V) elements and supports the following operations in at most O(log(N)) time where N is the number of elements. Memory usage is not a problem.
E get(index) // get element by index
int find(K) // find the index of the element whose K matches
delete(index) // delete element at index, the following elements have their indexes decreased by 1
insert(index, E) // insert element at index, the following elements have their indexes increased by 1
I have considered the following incorrect solutions:
Use array: find, delete, and insert will still O(N)
Use array + map of K to index: delete and insert will still cost O(N) for shifting elements and updating map
Use linked list + map of K to element address: get and find will still cost O(N)
In my imagination, the last solution is the closest, but instead of linked list, a self-balancing tree where each node stores the number of elements on the left of it will make it possible for us to do get in O(log(N)).
However I'm not sure if I'm correct, so I want to ask whether my imagination is correct and whether there is a name for this kind of data structure so I can look for off-the-shelf solution.
The closest data structure i could think of is treaps.
Implicit treap is a simple modification of the regular treap which is a very powerful data structure. In fact, implicit treap can be considered as an array with the following procedures implemented (all in O(logN)O(log⁡N) in the online mode):
Inserting an element in the array in any location
Removal of an arbitrary element
Finding sum, minimum / maximum element etc. on an arbitrary interval
Addition, painting on an arbitrary interval
Reversing elements on an arbitrary interval
Using modification with implicit keys allows you to do all operation except the second one (find the index of the element whose K matches). I'll edit this answer if i come up with a better idea :)

Data structure for inverting a subarray in log(n)

Build a Data structure that has functions:
set(arr,n) - initialize the structure with array arr of length n. Time O(n)
fetch(i) - fetch arr[i]. Time O(log(n))
invert(k,j) - (when 0 <= k <= j <= n) inverts the sub-array [k,j]. meaning [4,7,2,8,5,4] with invert(2,5) becomes [4,7,4,5,8,2]. Time O(log(n))
How about saving the indices in binary search tree and using a flag saying the index is inverted? But if I do more than 1 invert, it mess it up.
Here is how we can approach designing such a data structure.
Indeed, using a balanced binary search tree is a good idea to start.
First, let us store array elements as pairs (index, value).
Naturally, the elements are sorted by index, so that the in-order traversal of a tree will yield the array in its original order.
Now, if we maintain a balanced binary search tree, and store the size of the subtree in each node, we can already do fetch in O(log n).
Next, let us only pretend we store the index.
Instead, we still arrange elements as we did with (index, value) pairs, but store only the value.
The index is now stored implicitly and can be calculated as follows.
Start from the root and go down to the target node.
Whenever we move to a left subtree, the index does not change.
When moving to a right subtree, add the size of the left subtree plus one (the size of the current vertex) to the index.
What we got at this point is a fixed-length array stored in a balanced binary search tree. It takes O(log n) to access (read or write) any element, as opposed to O(1) for a plain fixed-length array, so it is about time to get some benefit for all the trouble.
The next step is to devise a way to split our array into left and right parts in O(log n) given the required size of the left part, and merge two arrays by concatenation.
This step introduces dependency on our choice of the balanced binary search tree.
Treap is the obvious candidate since it is built on top of the split and merge primitives, so this improvement comes for free.
Perhaps it is also possible to split a Red-black tree or a Splay tree in O(log n) (though I admit I didn't try to figure out the details myself).
Right now, the structure is already more powerful than an array: it allows splitting and concatenation of "arrays" in O(log n), although element access is as slow as O(log n) too.
Note that this would not be possible if we still stored index explicitly at this point, since indices would be broken in the right part of a split or merge operation.
Finally, it is time to introduce the invert operation.
Let us store a flag in each node to signal whether the whole subtree of this node has to be inverted.
This flag will be lazily propagating: whenever we access a node, before doing anything, check if the flag is true.
If this is the case, swap the left and right subtrees, toggle (true <-> false) the flag in the root nodes of both subtrees, and set the flag in the current node to false.
Now, when we want to invert a subarray:
split the array into three parts (before the subarray, the subarray itself, and after the subarray) by two split operations,
toggle (true <-> false) the flag in the root of the middle (subarray) part,
then merge the three parts back in their original order by two merge operations.

Deleting a node from the middle of a heap

Deleting a node from the middle of the heap can be done in O(lg n) provided we can find the element in the heap in constant time. Suppose the node of a heap contains id as its field. Now if we provide the id, how can we delete the node in O(lg n) time ?
One solution can be that we can have a address of a location in each node, where we maintain the index of the node in the heap. This array would be ordered by node ids. This requires additional array to be maintained though. Is there any other good method to achieve the same.
PS: I came across this problem while implementing Djikstra's Shortest Path algorithm.
The index (id, node) can be maintained separately in a hashtable which has O(1) lookup complexity (on average). The overall complexity then remains O(log n).
Each data structure is designed with certain operations in mind. From wikipedia about heap operations
The operations commonly performed with a heap are:
create-heap: create an empty heap
find-max or find-min: find the maximum item of a max-heap or a minimum item of a min-heap, respectively
delete-max or delete-min: removing the root node of a max- or min-heap, respectively
increase-key or decrease-key: updating a key within a max- or min-heap, respectively
insert: adding a new key to the heap
merge joining two heaps to form a valid new heap containing all the elements of both.
This means, heap is not the best data structure for the operation you are looking for. I would advice you to look for a better suited data structure(depending on your requirements)..
I've had a similar problem and here's what I've come up with:
Solution 1: if your calls to delete some random item will have a pointer to item, you can store your individual data items outside of the heap; have the heap be of pointers to these items; and have each item contain its current heap array index.
Example: the heap contains pointers to items with keys [2 10 5 11 12 6]. The item holding value 10 has a field called ArrayIndex = 1 (counting from 0). So if I have a pointer to item 10 and want to delete it, I just look at its ArrayIndex and use that in the heap for a normal delete. O(1) to find heap location, then usual O(log n) to delete it via recursive heapify.
Solution 2: If you only have the key field of the item you want to delete, not its address, try this. Switch to a red-black tree, putting your payload data in the actual tree nodes. This is also O( log n ) for insert and delete. It can additionally find an item with a given key in O( log n ), which makes delete-by-key continue to be log n.
Between these, solution 1 will require an overhead of constantly updating ArrayIndex fields with every swap. It also results in a kind of strange one-off data structure that the next code maintainer would need to study and understand. I think solution 2 would be about as fast, and has the advantage that it's a well-understood algo.

How to remove elements from a binary heap?

As I understand, binary heap does not support removing random elements. What if I need to remove random elements from a binary heap?
Obviously, I can remove an element and re-arrange the entire heap in O(N). Can I do better?
Yes and no.
The problem is a binary heap does not support search for an arbitrary element. Finding it is itself O(n).
However, if you have a pointer to the element (and not only its value) - you can swap the element with the rightest leaf, remove this leaf, and than re-heapify the relevant sub-heap (by sifting down the newly placed element as much as needed). This results in O(logn) removal, but requires a pointer to the actual element you are looking for.
Amit is right in his answer but here is one more nuance:
the position of the removed item (where you put the right-most leaf) can be required to be bubbled up (compare with parent and move up until parent is larger than you).
Sometimes it is required to bubble down (compare with children and move down until all children are smaller than you). It all depends on the case.
Depends on what is meant by "random element." If it means that the heap contains elements [e1, e2, ..., eN] and one wants to delete some ei (1 <= i <= N), then this is possible.
If you are using a binary heap implementation from some library, it might be that it doesn't provide you with the API that you need. In that case, you should look for another library that has it.
If you were to implement it yourself, you would need two additional calls:
A procedure deleteAtIndex(heap, i) that deletes the node at index i
by positioning the last element in the heap array at i, decrementing the
element count, and finally shuffling down/up
the new ith element to maintain the heap invariant. The most
common use of this procedure is to "pop" the heap by calling
deleteAtIndex(heap, 1) -- assuming 1-origin indexing. This operation
will run O(log n) (though, to be complete, I'll note that the highest bound can be
improved up to O(log(log n)) depending on some assumptions about your elements' keys).
A procedure deleteElement(heap, e) that deletes the element e (your arbitrary element).
Your heap algorithm would maintain an array ElementIndex such that ElementIndex[e] returns
the current index of element e: calling deleteAtIndex(heap, ElementIndex[e])
will then do what you want. It will also run in O(log n) because the array access is constant.
Since binary heaps are often used in algorithms that merely pop the highest (or lowest)
priority element (rather than deleting arbitrary elements), I imagine that some libraries might miss on the deleteAtIndex API to save space (the extra ElementIndex array mentioned above).

What is the fastest way of updating an ordered array of numbers?

I need to calculate a 1d histogram that must be dynamically maintained and looked up frequently. One idea I had involves keeping an ordered array with the data (cause thus I can determine percentiles in O(1), and this suffices for quickly finding a histogram with non-uniform bins with the exactly same amount of points inside each bin).
So, is there a way that is less than O(N) to insert a number into an ordered array while keeping it ordered?
I guess the answer is very well known but I don't know a lot about algorithms (physicists doing numerical calculations rarely do).
In the general case, you could use a more flexible tree-like data structure. This would allow access, insertion and deletion in O(log) time and is also relatively easy to get ready-made from a library (ex.: C++'s STL map).
(Or a hash map...)
An ordered array with binary search does the same things as a tree, but is more rigid. It might probably be faster for acess and memory use but you will pay when having to insert or delete things in the middle (O(n) cost).
Note, however, that an ordered array might be enough for you: if your data points are often the same, you can mantain a list of pairs {key, count}, ordered by key, being able to quickly add another instance of an existing item (but still having to do more work to add a new item)
You could use binary search. This is O(log(n)).
If you like to insert number x, then take the number in the middle of your array and compare it to x. if x is smaller then then take the number in the middle of the first half else the number in the middle of the second half and so on.
You can perform insertions in O(1) time if you rearrange your array as a bunch of linked-lists hanging off of each element:
keys = Array([0][1][2][3][4]......)
a c b e f . .
d g i . . .
h j .
|__|__|__|__|__|__|__/linked lists
There's also the strategy of keeping two datastructures at the same time, if your update workload supports it without increasing time-complexity of common operations.
So, is there a way that is less than O(N) to insert a number into an
ordered array while keeping it ordered?
Yes, you can use an array to implement a binary search tree using arrays and do the insertion in O(log n) time. How?
Keep index 0 empty; index 1 = root; if node is the left child of parent node, index of node = 2 * index of parent node; if node is the right child of parent node, index of node = 2 * index of parent node + 1.
Insertion will thus be O(log n). Unfortunately, you might notice that the binary search tree for an ordered list might degenerate to a linear search if you don't balance the tree i.e. O(n), which is pointless. Here, you may have to implement a red black tree to keep the height balanced. However, this is quite complicated, BUT insertion can be done with arrays in O(log n). Note that the array elements will no longer be ints; instead, they'll have to be objects with a colour attribute.
I wouldn't recommend it.
Any particular reason this demands an array? You need an data structure which keeps data ordered and allows you to insert quickly. Why not a binary search tree? Or better still, a red black tree. In C++, you could use the Set structure in the Standard template library which is implemented as a red black tree. Gives you O(log(n)) insertion time and the ability to iterate over it like an array.

Resources