Indexing count of buckets - algorithm

So, here is my little problem.
Let's say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dynamically, though I don't think it would help much.
The order of the buckets matter. I can't go and swap them around.
Now, I'd like to index these buckets so that:
I know the total count of items
I can look-up the ith element
I can add/remove items from any bucket and update the index efficiently
Seems easy right ? Seeing these criteria I immediately thought about a Fenwick Tree. That's what they are meant for really.
However, when you think about the use cases, a few other use cases creep in:
if a bucket count drops below L, the bucket must disappear (don't worry about the items yet)
if a bucket count reaches H, then a new bucket must be created because this one is full
I haven't figured out how to edit a Fenwick Tree efficiently: remove / add a node without rebuilding the whole tree...
Of course we could setup L = 0, so that removing would become unecessary, however adding items cannot really be avoided.
So here is the question:
Do you know either a better structure for this index or how to update a Fenwick Tree ?
The primary concern is efficiency, and because I do plan to implement it cache/memory considerations are worth worrying about.
Background:
I am trying to come up with a structure somewhat similar to B-Trees and Ranked Skip Lists but with a localized index. The problem of those two structures is that the index is kept along the data, which is inefficient in term of cache (ie you need to fetch multiple pages from memory). Database implementations suggest that keeping the index isolated from the actual data is more cache-friendly, and thus more efficient.

I have understood your problem as:
Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.
To solve that:
What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.
Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.
Eg, say the bucket sizes are 2,1,4,8.
The tree will look like
15
/ \
3 12
/ \ / \
2 1 4 8
If you want the total count, read the value of the root node.
If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.
For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)
19*
/ \
7* 12
/ \ / \
2 5* 4 8
If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.
Say you wanted to find the 9th element in the above tree:
Since left child of root is 7 < 9.
You subtract 7 from 9 (to get 2) and go right.
Since 2 < 4 (the left child of 12), you go left.
You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.
If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.
Getting the total count is O(1).
Updating single bucket/lookup of item are O(logn).
Adding new bucket is amortized O(1).
Space usage is O(n).
Instead of a binary tree, you can probably do the same with a B-Tree.

I still hope for answers, however here is what I could come up so far, following #Moron suggestion.
Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.
We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.
Typically, this does not separate the data from the index, however we can get this behavior by:
Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache
I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.
These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.
Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?
It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.

Related

Binary tree without pointers

Below is a representation of a binary tree that I use in my project. In the bottom are the leaf nodes (orange boxes), and every level is the sum of the children below.
So, 3 on the leftmost node is the sum of 1 and 2 (it's left and right children), 10 is the sum of 3 and 7 (again left and right children).
What I am trying to do is, store this tree in a flat array without using any pointers. So this array is basically an integer array, holding 2n-1 nodes (n is the number of the leaf nodes).
So the index of the root element is 0 (let's call it p), and the index of it's left child is 2p+1, index of the right child is 2p+2. Please see Binary Tree (Array implementation)
Everything works nicely if I know the number of leaf values beforehand but I can't seem to find a way to store this tree in a dynamically expanding array.
If I need to add 9 for example as the 9th element to the array, the structure needs to change and I need to recalculate all the indices again which I refrain because there may be hundreds of thousand of elements in the array at any time.
Does anyone know of an implementation that handles dynamic arrays with this implementation?
EDIT:
Below is the demonstration of what happens when I add new elements to the array. 36 was the root before, now it's a second level element and the new root array[0] is 114, which triggers a new layout.

Data structure which is efficient to add new number and count how many stored numbers smaller than some query number

I want to find an efficient data structure that can handle the following use case.
I can add new elements to this data structure, e.g.
I call add() API, add([2,3,4,5,3]), then this data structure stores [2,3,3,4,5]. I can query some target and return how many numbers smaller than this target. e.g. query(4), return 3 (since one 2 and two 3). And the frequencies of calling add and query are in the same order.
Firstly, I think of segment tree, however, the input number can be anyone in int value, space will be O(2^32)
Could you give me some advice about which data structure should I use?
You can do this using an order statistic tree, which is a kind of binary search tree where each node also stores the cardinality of its own subtree. Inserting into an order statistic tree still takes O(log n) time, because it's a binary search tree, although the insert operation is a little more complicated because it has to keep the cardinalities of each node up-to-date.
Computing the number of members less than a given target also takes O(log n) time; start at the root node:
If the target is less than or equal to the root node's value, then recurse on the left subtree.
Otherwise, return the left child's cardinality plus the result of recursing on the right subtree.
The base case is that you always return 0 for an empty subtree.

Segment tree with easy shifting

I need Data Structure based on Segment Tree, but with one difference to clasical segment tree. DS should support easy shifting of elements. I mean I would like to have ds on which I could:
make queries on segments (i.e. sum of elements from index l to index r)
insert new elemnts before any index and then shift all elements on the right side of new elements
It will be nice if all these operations will work in O(logn)
Greetings
Yes, but not sure you can keep the tree balanced.
The basic structure is like this. Every node keeps track of only the distance between the start of the interval it keeps track of and the split between its children. For example if a node keeps the interval [A, B] and has children keeping [A,C] and [C+1, B] then the first node should only store the information C - A. This will allow you to easily change the sizes of the intervals without having to mess with the entire structure. This also means that you invalidate any existing iterators when you shift anything inside and that each iterator keeps track of the interval.
To do a shift operation:
do a search for the insertion point.
pick an appropriate node on the path.
insert a new node above the selected node. This nodes should contain the old + new interval. So set it's split value to the size of the shift. Now the new space is the left child the old space is the right child.
add any children you want to keep for the new space.
update all the parents where the split point was on the left since now there are more values before their split.
Any other operation should be done the same. You should pick a node where the new interval is roughly equal to the size of the node so that you keep the O(logn) for operations. Obviously inserting 1 element over and over again can cause some paths to be considerably longer than others, unless you also add a step to rebalance the tree after you do a shift.
However, if you know the shifts before, I would simply go through the shifts backwards and compute the final location of all the data and queries O(N). Then you can simply do a regular segment tree and not worry about the shifts.

return inserted items for a given interval

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

Are duplicate keys allowed in the definition of binary search trees?

I'm trying to find the definition of a binary search tree and I keep finding different definitions everywhere.
Some say that for any given subtree the left child key is less than or equal to the root.
Some say that for any given subtree the right child key is greater than or equal to the root.
And my old college data structures book says "every element has a key and no two elements have the same key."
Is there a universal definition of a bst? Particularly in regards to what to do with trees with multiple instances of the same key.
EDIT: Maybe I was unclear, the definitions I'm seeing are
1) left <= root < right
2) left < root <= right
3) left < root < right, such that no duplicate keys exist.
Many algorithms will specify that duplicates are excluded. For example, the example algorithms in the MIT Algorithms book usually present examples without duplicates. It is fairly trivial to implement duplicates (either as a list at the node, or in one particular direction.)
Most (that I've seen) specify left children as <= and right children as >. Practically speaking, a BST which allows either of the right or left children to be equal to the root node, will require extra computational steps to finish a search where duplicate nodes are allowed.
It is best to utilize a list at the node to store duplicates, as inserting an '=' value to one side of a node requires rewriting the tree on that side to place the node as the child, or the node is placed as a grand-child, at some point below, which eliminates some of the search efficiency.
You have to remember, most of the classroom examples are simplified to portray and deliver the concept. They aren't worth squat in many real-world situations. But the statement, "every element has a key and no two elements have the same key", is not violated by the use of a list at the element node.
So go with what your data structures book said!
Edit:
Universal Definition of a Binary Search Tree involves storing and search for a key based on traversing a data structure in one of two directions. In the pragmatic sense, that means if the value is <>, you traverse the data structure in one of two 'directions'. So, in that sense, duplicate values don't make any sense at all.
This is different from BSP, or binary search partition, but not all that different. The algorithm to search has one of two directions for 'travel', or it is done (successfully or not.) So I apologize that my original answer didn't address the concept of a 'universal definition', as duplicates are really a distinct topic (something you deal with after a successful search, not as part of the binary search.)
If your binary search tree is a red black tree, or you intend to any kind of "tree rotation" operations, duplicate nodes will cause problems. Imagine your tree rule is this:
left < root <= right
Now imagine a simple tree whose root is 5, left child is nil, and right child is 5. If you do a left rotation on the root you end up with a 5 in the left child and a 5 in the root with the right child being nil. Now something in the left tree is equal to the root, but your rule above assumed left < root.
I spent hours trying to figure out why my red/black trees would occasionally traverse out of order, the problem was what I described above. Hopefully somebody reads this and saves themselves hours of debugging in the future!
All three definitions are acceptable and correct. They define different variations of a BST.
Your college data structure's book failed to clarify that its definition was not the only possible.
Certainly, allowing duplicates adds complexity. If you use the definition "left <= root < right" and you have a tree like:
3
/ \
2 4
then adding a "3" duplicate key to this tree will result in:
3
/ \
2 4
\
3
Note that the duplicates are not in contiguous levels.
This is a big issue when allowing duplicates in a BST representation as the one above: duplicates may be separated by any number of levels, so checking for duplicate's existence is not that simple as just checking for immediate childs of a node.
An option to avoid this issue is to not represent duplicates structurally (as separate nodes) but instead use a counter that counts the number of occurrences of the key. The previous example would then have a tree like:
3(1)
/ \
2(1) 4(1)
and after insertion of the duplicate "3" key it will become:
3(2)
/ \
2(1) 4(1)
This simplifies lookup, removal and insertion operations, at the expense of some extra bytes and counter operations.
In a BST, all values descending on the left side of a node are less than (or equal to, see later) the node itself. Similarly, all values descending on the right side of a node are greater than (or equal to) that node value(a).
Some BSTs may choose to allow duplicate values, hence the "or equal to" qualifiers above. The following example may clarify:
14
/ \
13 22
/ / \
1 16 29
/ \
28 29
This shows a BST that allows duplicates(b) - you can see that to find a value, you start at the root node and go down the left or right subtree depending on whether your search value is less than or greater than the node value.
This can be done recursively with something like:
def hasVal (node, srchval):
if node == NULL:
return false
if node.val == srchval:
return true
if node.val > srchval:
return hasVal (node.left, srchval)
return hasVal (node.right, srchval)
and calling it with:
foundIt = hasVal (rootNode, valToLookFor)
Duplicates add a little complexity since you may need to keep searching once you've found your value, for other nodes of the same value. Obviously that doesn't matter for hasVal since it doesn't matter how many there are, just whether at least one exists. It will however matter for things like countVal, since it needs to know how many there are.
(a) You could actually sort them in the opposite direction should you so wish provided you adjust how you search for a specific key. A BST need only maintain some sorted order, whether that's ascending or descending (or even some weird multi-layer-sort method like all odd numbers ascending, then all even numbers descending) is not relevant.
(b) Interestingly, if your sorting key uses the entire value stored at a node (so that nodes containing the same key have no other extra information to distinguish them), there can be performance gains from adding a count to each node, rather than allowing duplicate nodes.
The main benefit is that adding or removing a duplicate will simply modify the count rather than inserting or deleting a new node (an action that may require re-balancing the tree).
So, to add an item, you first check if it already exists. If so, just increment the count and exit. If not, you need to insert a new node with a count of one then rebalance.
To remove an item, you find it then decrement the count - only if the resultant count is zero do you then remove the actual node from the tree and rebalance.
Searches are also quicker given there are fewer nodes but that may not be a large impact.
For example, the following two trees (non-counting on the left, and counting on the right) would be equivalent (in the counting tree, i.c means c copies of item i):
__14__ ___22.2___
/ \ / \
14 22 7.1 29.1
/ \ / \ / \ / \
1 14 22 29 1.1 14.3 28.1 30.1
\ / \
7 28 30
Removing the leaf-node 22 from the left tree would involve rebalancing (since it now has a height differential of two) the resulting 22-29-28-30 subtree such as below (this is one option, there are others that also satisfy the "height differential must be zero or one" rule):
\ \
22 29
\ / \
29 --> 28 30
/ \ /
28 30 22
Doing the same operation on the right tree is a simple modification of the root node from 22.2 to 22.1 (with no rebalancing required).
In the book "Introduction to algorithms", third edition, by Cormen, Leiserson, Rivest and Stein, a binary search tree (BST) is explicitly defined as allowing duplicates. This can be seen in figure 12.1 and the following (page 287):
"The keys in a binary search tree are always stored in such a way as to satisfy the binary-search-tree property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y:key <= x:key. If y is a node in the right subtree of x, then y:key >= x:key."
In addition, a red-black tree is then defined on page 308 as:
"A red-black tree is a binary search tree with one extra bit of storage per node: its color"
Therefore, red-black trees defined in this book support duplicates.
Any definition is valid. As long as you are consistent in your implementation (always put equal nodes to the right, always put them to the left, or never allow them) then you're fine. I think it is most common to not allow them, but it is still a BST if they are allowed and place either left or right.
I just want to add some more information to what #Robert Paulson answered.
Let's assume that node contains key & data. So nodes with the same key might contain different data.
(So the search must find all nodes with the same key)
left <= cur < right
left < cur <= right
left <= cur <= right
left < cur < right && cur contain sibling nodes with the same key.
left < cur < right, such that no duplicate keys exist.
1 & 2. works fine if the tree does not have any rotation-related functions to prevent skewness.
But this form doesn't work with AVL tree or Red-Black tree, because rotation will break the principal.
And even if search() finds the node with the key, it must traverse down to the leaf node for the nodes with duplicate key.
Making time complexity for search = theta(logN)
3. will work well with any form of BST with rotation-related functions.
But the search will take O(n), ruining the purpose of using BST.
Say we have the tree as below, with 3) principal.
12
/ \
10 20
/ \ /
9 11 12
/ \
10 12
If we do search(12) on this tree, even tho we found 12 at the root, we must keep search both left & right child to seek for the duplicate key.
This takes O(n) time as I've told.
4. is my personal favorite. Let's say sibling means the node with the same key.
We can change above tree into below.
12 - 12 - 12
/ \
10 - 10 20
/ \
9 11
Now any search will take O(logN) because we don't have to traverse children for the duplicate key.
And this principal also works well with AVL or RB tree.
Working on a red-black tree implementation I was getting problems validating the tree with multiple keys until I realized that with the red-black insert rotation, you have to loosen the constraint to
left <= root <= right
Since none of the documentation I was looking at allowed for duplicate keys and I didn't want to rewrite the rotation methods to account for it, I just decided to modify my nodes to allow for multiple values within the node, and no duplicate keys in the tree.
Those three things you said are all true.
Keys are unique
To the left are keys less than this one
To the right are keys greater than this one
I suppose you could reverse your tree and put the smaller keys on the right, but really the "left" and "right" concept is just that: a visual concept to help us think about a data structure which doesn't really have a left or right, so it doesn't really matter.
1.) left <= root < right
2.) left < root <= right
3.) left < root < right, such that no duplicate keys exist.
I might have to go and dig out my algorithm books, but off the top of my head (3) is the canonical form.
(1) or (2) only come about when you start to allow duplicates nodes and you put duplicate nodes in the tree itself (rather than the node containing a list).
Duplicate Keys
• What happens if there's more than one data item with
the same key?
– This presents a slight problem in red-black trees.
– It's important that nodes with the same key are distributed on
both sides of other nodes with the same key.
– That is, if keys arrive in the order 50, 50, 50,
• you want the second 50 to go to the right of the first one, and the
third 50 to go to the left of the first one.
• Otherwise, the tree becomes unbalanced.
• This could be handled by some kind of randomizing
process in the insertion algorithm.
– However, the search process then becomes more complicated if
all items with the same key must be found.
• It's simpler to outlaw items with the same key.
– In this discussion we'll assume duplicates aren't allowed
One can create a linked list for each node of the tree that contains duplicate keys and store data in the list.
The elements ordering relation <= is a total order so the relation must be reflexive but commonly a binary search tree (aka BST) is a tree without duplicates.
Otherwise if there are duplicates you need run twice or more the same function of deletion!

Resources