k-th largest element in B-tree [duplicate] - performance

I'm trying to understand how I should think about getting the k-th key/element in a B-tree. Even if it's steps instead of code, it will still help a lot. Thanks
Edit: To clear up, I'm asking for the k-th smallest key in the B-tree.

There's no efficient way to do it using a standard B-tree. Broadly speaking, I see 2 options:
Convert the B-tree to an order statistic tree to allow for this operation in O(log n).
That is, for each node, keep a variable representing the size (number of elements) of the subtree rooted at that node (that node, all its children, all its children's children, etc.).
Whenever you do an insertion or deletion, you update this variable appropriately. You will only need to update nodes already being visited, so it won't change the complexity of those operations.
Getting the k-th element would involve adding up the sizes of the children until we get to k, picking the appropriate child to visit and decreasing k appropriately. Pseudo-code:
select(root, k) // initial call for root
// returns the k'th element of the elements in node
function select(node, k)
for i = 0 to t.elementCount
size = 0
if node.child[i] != null
size = node.sizeOfChild[i]
if k < size // element is in the child subtree
return select(node.child[i], k)
else if k == size // element is here
&& i != t.elementCount // only equal when k == elements in tree, i.e. k is not valid
return t.element[i]
else // k > size, element is to the right
k -= size + 1 // child[i] subtree + t.element[i]
return null // k > elements in tree
Consider child[i] to be directly to the left of element[i].
The pseudo-code for the binary search tree (not B-tree) provided on Wikipedia may explain the basic concept here better than the above.
Note that the size of a node's subtree should be store in its parent (note that I didn't use node.child[i].size above). Storing it in the node itself will be much less efficient, as reading nodes is considered a non-trivial or expensive operation for B-tree use cases (nodes must often be read from disk), thus you want to minimise the number of nodes read, even if that would make each node slightly bigger.
Do an in-order traversal until you've seen k elements - this will take O(n).
Pseudo-code:
select(root, *k) // initial call for root
// returns the k'th element of the elements in node
function select(node, *k) // pass k by pointer, allowing global update
if node == null
return null
for i = 0 to t.elementCount
element = select(node.child[i], k) // check if it's in the child's subtree
if element != null // element was found
return element
if i != t.elementCount // exclude last iteration
if k == 0 // element is here
return t.element[i]
(*k)-- // only decrease k for t.element[i] (i.e. by 1),
// k is decreased for node.child[i] in the recursive call
return null

You can use a new balanced binary search tree(like Splay or just using std::set) to record what elements are currently in the B-Tree. This will allow every operation to finish in O(logn), and its quite easy to implement(when using std::set) but will double the space cost.

Ok so, after a few sleepless hours I managed to do it, and for anyone who will wonder how, here it goes in pseudocode (k=0 for first element):
get_k-th(current, k):
for i = 0 to current.number_of_children_nodes
int size = size_of_B-tree(current.child[i])
if(k <= size-1)
return get_k-th(current.child[i], k)
else if(k == size && i < current.number_of_children_nodes)
return current.key[i]
else if (is_leaf_node(current) && k < current.number_of_children_nodes)
return node.key[k]
k = k - size - 1;
return null
I know this might look kinda weird, but it's what I came up with and thankfully it works. There might be a way to make this code clearer, and probably more efficient, but I hope it's good enough to help anyone else who might get stuck on the same obstacle as I did.

Related

Avl tree- find first key missing from the tree after i

I have the above problem that I'm trying to solve. The Avl tree has also a size of substree for every node and I know the maximum. I need to find the next first number after i which isn't in the tree. I need to do it in O(logn) time.
I got to
if i bigger/equal the maximum then return i+1,
I tried to do the other cases to find the minimum after i that is in the tree and I know I can do it in O(logn) if the number I found is bigger than i+1 return i+1.
Now I understand that if i+1 is in the tree, I need to keep searching but I'm getting time complexity bigger than I need this way.
Would greatly appreciate any guidance. I'm not looking for code, only an idea or guidance how to solve it in the time specified
I think your problem might be more in within the time complexity analysis than the actual algorithm.
We know that, if properly done, the time of a search in a well formed AVL tree of height log[2](n) will always be log[2](n). Searching for a missing item in this case is no different than searching for an existing item.
Let's say you have an AVL tree A and it includes i and i+1. Then we know that i+1 must either be the parent node of i and i being the left child node, or that i+1 is the right child node of i. So we can conclude:
if i ^ i+1 in A => i+1(l)=i v i(r)=i+1
So if you find i and its parent node is not i+1 its right child node has to be i+1. You can extend this to i=i+1 after finding i+1 and keep checking for this condition. The cool thing here is there is only one place you need to look at for every value i+n after i if you keep track of the nodes you have traversed through.
If you go through [i+7, i+4, i] You immediately know that if A contains i it cannot contain i+1. This is due to i+1 < i+4 but i < i+1 < i+4.
If you go through [i-6, i-2, i] You also immediately know that if A contains i+1 it cannot contain i+1. This is due to i-2 < i+1 but i-2 < i < i+1.
If you were to go through [i+7, i+3, i+1, i] you found i, i+1 and since i+3 is not i+2 you know i+2 has to be the right child node of i+1 since it must not be a child of i+3 since it is smaller, but i+1 already took the left child position. So you check if i+1's right child is i+2, you continue checking for i+4 from i+3 on, essentially using the algorithm:
define stack //use your favourite stack implementaion
let n = root node
let i = yourval
while n.val != i
stack.push(n)
if i > n.val
n = n.right
else //equivalent to "else if i < n.val" since loop condition ensures they are not equal
n = n.left
while !stack.empty
if stack.peek.right.val != queue.peek.val + 1
//Implies parent holds value
temp = stack.pop.val + 1
if(temp != stack.peek.val) //If the parent does not hold the next value return it
return temp;
else //Right child holds value
stack.push(queue.peek.right)
i = stack.peek.val
return i+1 //if the stack is empty eventually return the next value
Due to how AVL trees are formed your stack will at most be 2*logn[2](n) elements large (if i is a leaf on the LHS and the last value is a leaf on the RHS). So in total your search will take log[2](n) for the initial search for i and another 2*log[2](n) combined that makes 3*log[2](n), which in Big Omicron is still O(log[2](n)).
As a hint, think about how you'd solve this problem if you had the elements in an array rather than an AVL tree. How would you solve this problem in time O(log n) in an array by using a modified binary search?
Once you've worked out how to do that, see if you can adjust your solution to work in a binary search tree rather than an array. The intuition will be pretty similar, except that instead of looking at the middle of the overall range of elements at each point, you'll look at the root of the current subtree, which, while close to the middle, isn't always exactly at the middle.

Find whether given sum exists over a path in a BST

The question is to find whether a given sum exists over any path in a BST. The question is damn easy if a path means root to leaf, or easy if the path means a portion of a path from root to leaf that may not include the root or the leaf. But it becomes difficult here, because a path may span both left and right child of a node. For example, in the given figure, a sum of 132 exists over the circled path. How can I find the existence of such a path? Using hash to store all possible sums under a node is frowned upon!
You can certainly generate all possible paths, summing incrementally as you go. The fact that the tree is a BST might let you save time by bounding out certain sums, though I'm not sure that will give an asymptotic speed increase. The problem is that a sum formed using the left child of a given node will not necessarily be less than a sum formed using the right child, since the path for the former sum could contain many more nodes. The following algorithm will work for all trees, not just BSTs.
To generate all possible paths, notice that the topmost point of a path is special: it's the only point in a path which is allowed (though not required) to have both children contained in the path. Every path contains a unique topmost point. Therefore the outer layer of recursion should be to visit every tree node, and to generate all paths that have that node as the topmost point.
// Report whether any path whose topmost node is t sums to target.
// Recurses to examine every node under t.
EnumerateTopmost(Tree t, int target) {
// Get a list of sums for paths containing the left child.
// Include a 0 at the start to account for a "zero-length path" that
// does not contain any children. This will be in increasing order.
a = append(0, EnumerateSums(t.left))
// Do the same for paths containing the right child. This needs to
// be sorted in decreasing order.
b = reverse(append(0, EnumerateSums(t.right)))
// "List match" to detect any pair of sums that works.
// This is a linear-time algorithm that takes two sorted lists --
// one increasing, the other decreasing -- and detects whether there is
// any pair of elements (one from the first list, the other from the
// second) that sum to a given value. Starting at the beginning of
// each list, we compute the current sum, and proceed to strike out any
// elements that we know cannot be part of a satisfying pair.
// If the sum of a[i] and b[j] is too small, then we know that a[i]
// cannot be part of any satisfying pair, since all remaining elements
// from b that it could be added to are at least as small as b[j], so we
// can strike it out (which we do by advancing i by 1). Similarly if
// the sum of a[i] and b[j] is too big, then we know that b[j] cannot
// be part of any satisfying pair, since all remaining elements from a
// that b[j] could be added to are at least as big as a[i], so we can
// strike it out (which we do by advancing j by 1). If we get to the
// end of either list without finding the right sum, there can be
// no satisfying pair.
i = 0
j = 0
while (i < length(a) and j < length(b)) {
if (a[i] + b[j] + t.value < target) {
i = i + 1
} else if (a[i] + b[j] + t.value > target) {
j = j + 1
} else {
print "Found! Topmost node=", t
return
}
}
// Recurse to examine the rest of the tree.
EnumerateTopmost(t.left)
EnumerateTopmost(t.right)
}
// Return a list of all sums that contain t and at most one of its children,
// in increasing order.
EnumerateSums(Tree t) {
If (t == NULL) {
// We have been called with the "child" of a leaf node.
return [] // Empty list
} else {
// Include a 0 in one of the child sum lists to stand for
// "just node t" (arbitrarily picking left here).
// Note that even if t is a leaf node, we still call ourselves on
// its "children" here -- in C/C++, a special "NULL" value represents
// these nonexistent children.
a = append(0, EnumerateSums(t.left))
b = EnumerateSums(t.right)
Add t.value to each element in a
Add t.value to each element in b
// "Ordinary" list merge that simply combines two sorted lists
// to produce a new sorted list, in linear time.
c = ListMerge(a, b)
return c
}
}
The above pseudocode only reports the topmost node in the path. The entire path can be reconstructed by having EnumerateSums() return a list of pairs (sum, goesLeft) instead of a plain list of sums, where goesLeft is a boolean that indicates whether the path used to generate that sum initially goes left from the parent node.
The above pseudocode calculates sum lists multiple times for each node: EnumerateSums(t) will be called once for each node above t in the tree, in addition to being called for t itself. It would be possible to make EnumerateSums() memoise the list of sums for each node so that it's not recomputed on subsequent calls, but actually this doesn't improve the asymptotics: only O(n) work is required to produce a list of n sums using the plain recursion, and changing this to O(1) doesn't change the overall time complexity because the entire list of sums produced by any call to EnumerateSums() must in general be read by the caller anyway, and this requires O(n) time. EDIT: As pointed out by Evgeny Kluev, EnumerateSums() actually behaves like a merge sort, being O(nlog n) when the tree is perfectly balanced and O(n^2) when it is a single path. So memoisation will in fact give an asymptotic performance improvement.
It is possible to get rid of the temporary lists of sums by rearranging EnumerateSums() into an iterator-like object that performs the list merge lazily, and can be queried to retrieve the next sum in increasing order. This would entail also creating an EnumerateSumsDown() that does the same thing but retrieves sums in decreasing order, and using this in place of reverse(append(0, EnumerateSums(t.right))). Doing this brings the space complexity of the algorithm down to O(n), where n is the number of nodes in the tree, since each iterator object requires constant space (pointers to left and right child iterator objects, plus a place to record the last sum) and there can be at most one per tree node.
i would in order traverse the left subtree and in reverse order traverse the right subtree at the same time kind of how merge sort works. each time move the iterator that makes the aum closer.like merge sort almost. its order n
Not the fastest, but simple approach would be to use two nested depth-first searches.
Use normal depth-first search to get starting node. Use second, modified version of depth-first search to check sums for all paths, starting from this node.
Second depth-first search is different from normal depth-first search in two details:
It keeps current path sum. It adds value to the sum each time a new node is added to the path and removes value from the sum when some node is removed.
It traverses edges of the path from root to the starting node only in opposite direction (red edges on diagram). All other edges are traversed in proper direction, as usual (black edges on diagram). To traverse edges in opposite direction, it either uses "parent" pointers of the original BST (if there are any), or peeks into the stack of first depth-first search to obtain these "parent" pointers.
Time complexity of each DFS in O(N), so total time complexity is O(N2). Space requirements are O(N) (space for both DFS stacks). If original BST contains "parent" pointers, space requirements are O(1) ("parent" pointers allow traversing the tree in any direction without stacks).
Other approach is based on ideas by j_random_hacker and robert king (maintaining lists of sums, matching them, then merging them together). It processes the tree in bottom-up manner (starting from leafs).
Use DFS to find some leaf node. Then go back and find the last branch node, that is a grand-...-grand-parent of this leaf node. This gives a chain between branch and leaf nodes. Process this chain:
match1(chain)
sum_list = sum(chain)
match1(chain):
i = j = sum = 0
loop:
while (sum += chain[i]) < target:
++i
while (sum -= chain[j]) > target:
++j
if sum == target:
success!
sum(chain):
result = [0]
sum = 0
i = chain.length - 1
loop:
sum += chain[i]
--i
result.append(sum)
return result
Continue DFS and search other leaf chains. When two chains, coming from the same node are found, possibly preceded by another chain (red and green chains on diagram, preceded by blue chain), process these chains:
match2(parent, sum_list1, sum_list2)
sum_list3 = merge1(parent, sum_list1, sum_list2)
if !chain3.empty:
match1(chain3)
match3(sum_list3, chain3)
sum_list4 = merge2(sum_list3, chain3)
match2(parent, sum_list1, sum_list2):
i = 0
j = chain2.length - 1
sum = target - parent.value
loop:
while sum < sum_list1[i] + sum_list2[j]:
++i
while sum > sum_list1[i] + sum_list2[j]:
--j
if sum == sum_list1[i] + sum_list2[j]:
success!
merge1(parent, sum_list1, sum_list2):
result = [0, parent.value]
i = j = 1
loop:
if sum_list1[i] < sum_list2[j]:
result.append(parent.value + sum_list1[i])
++i
else:
result.append(parent.value + sum_list2[j])
++j
return result
match3(sum_list3, chain3):
i = sum = 0
j = sum_list3.length - 1
loop:
sum += chain3[i++]
while sum_list3[j] + sum > target:
--j
if sum_list3[j] + sum == target:
success!
merge2(sum_list3, chain3):
result = [0]
sum = 0
i = chain3.length - 1
loop:
sum += chain3[i--]
result.append(sum)
result.append(sum_list3[1...] + sum)
Do the same wherever any two lists of sums or a chain and a list of sums are descendants of the same node. This process may be continued until a single list of sums, belonging to root node, remains.
Is there any complexity restrictions?
As you stated: "easy if a path means root to leaf, or easy if the path means a portion of a path from root to leaf that may not include the root or the leaf".
You can reduce the problem to this statement by setting the root each time to a different node and doing the search n times.
That would be a straightforward approach, not sure if optimal.
Edit: if the tree is unidirectional, something of this kind might work (pseudocode):
findSum(tree, sum)
if(isLeaf(tree))
return (sum == tree->data)
for (i = 0 to sum)
isfound |= findSum(leftSubStree, i) && findSum(rightSubTree, sum-i)
return isfound;
Probably lots of mistakes here, but hopefully it clarifies the idea.

How to delete in a heap data structure?

I understand how to delete the root node from a max heap but is the procedure for deleting a node from the middle to remove and replace the root repeatedly until the desired node is deleted?
Is O(log n) the optimal complexity for this procedure?
Does this affect the big O complexity since other nodes must be deleted in order to delete a specific node?
Actually, you can remove an item from the middle of a heap without trouble.
The idea is to take the last item in the heap and, starting from the current position (i.e. the position that held the item you deleted), sift it up if the new item is greater than the parent of the old item. If it's not greater than the parent, then sift it down.
That's the procedure for a max heap. For a min heap, of course, you'd reverse the greater and less cases.
Finding an item in a heap is an O(n) operation, but if you already know where it is in the heap, removing it is O(log n).
I published a heap-based priority queue for DevSource a few years back. The full source is at http://www.mischel.com/pubs/priqueue.zip
Update
Several have asked if it's possible to move up after moving the last node in the heap to replace the deleted node. Consider this heap:
1
6 2
7 8 3
If you delete the node with value 7, the value 3 replaces it:
1
6 2
3 8
You now have to move it up to make a valid heap:
1
3 2
6 8
The key here is that if the item you're replacing is in a different subtree than the last item in the heap, it's possible that the replacement node will be smaller than the parent of the replaced node.
The problem with removing an arbitrary element from a heap is that you cannot find it.
In a heap, looking for an arbitrary element is O(n), thus removing an element [if given by value] is O(n) as well.
If it is important for you to remove arbitrary elements form the data structure, a heap is probably not the best choice, you should consider full sorted data structurs instead such as balanced BST or a skip list.
If your element is given by reference, it is however possible to remove it in O(logn) by simply 'replacing' it with the last leaf [remember a heap is implemented as a complete binary tree, so there is a last leaf, and you know exactly where it is], remove these element, and re-heapify the relevant sub heap.
If you have a max heap, you could implement this by assigning a value larger than any other (eg something like int.MaxValue or inf in whichever language you are using) possible to the item to be deleted, then re-heapify and it will be the new root. Then perform a regular removal of the root node.
This will cause another re-heapify, but I can't see an obvious way to avoid doing it twice. This suggests that perhaps a heap isn't appropriate for your use-case, if you need to pull nodes from the middle of it often.
(for a min heap, you can obviously use int.MinValue or -inf or whatever)
Removing an element from a known heap array position has O(log n) complexity (which is optimal for a heap). Thus, this operation has the same complexity as extracting (i.e. removing) the root element.
The basic steps for removing the i-th element (where 0<=i<n) from heap A (with n elements) are:
swap element A[i] with element A[n-1]
set n=n-1
possibly fix the heap such that the heap-property is satisfied for all elements
Which is pretty similar to how the extraction of the root element works.
Remember that the heap-property is defined in a max-heap as:
A[parent(i)] >= A[i], for 0 < i < n
Whereas in a min-heap it's:
A[parent(i)] <= A[i], for 0 < i < n
In the following we assume a max-heap to simplify the description. But everything works analogously with a min-heap.
After the swap we have to distinguish 3 cases:
new key in A[i] equals the old key - nothing changes, done
new key in A[i] is greater than the old key. Nothing changes for the sub-trees l and r of i. If previously A[parent(i)] >= A[j] was true then now A[parent(i)]+c >= A[j] must be true as well (for j in (l, r) and c>=0). But the ancestors of element i might need fixing. This fix-up procedure is basically the same as when increasing A[i].
new key in A[i] is smaller than the old key. Nothing changes for the ancestors of element i, because if the previous value already satisfied the heap property, a smaller value values does it as well. But the sub-trees might now need fixing, i.e. in the same way as when extracting the maximum element (i.e. the root).
An example implementation:
void heap_remove(A, i, &n)
{
assert(i < n);
assert(is_heap(A, i));
--n;
if (i == n)
return;
bool is_gt = A[n] > A[i];
A[i] = A[n];
if (is_gt)
heapify_up(A, i);
else
heapify(A, i, n);
}
Where heapifiy_up() basically is the textbook increase() function - modulo writing the key:
void heapify_up(A, i)
{
while (i > 0) {
j = parent(i);
if (A[i] > A[j]) {
swap(A, i, j);
i = j;
} else {
break;
}
}
}
And heapify() is the text-book sift-down function:
void heapify(A, i, n)
{
for (;;) {
l = left(i);
r = right(i);
maxi = i;
if (l < n && A[l] > A[i])
maxi = l;
if (r < n && A[r] > A[i])
maxi = r;
if (maxi == i)
break;
swap(A, i, maxi);
i = maxi;
}
}
Since the heap is an (almost) complete binary tree, its height is in O(log n). Both heapify functions have to visit all tree levels, in the worst case, thus the removal by index is in O(log n).
Note that finding the element with a certain key in a heap is in O(n). Thus, removal by key value is in O(n) because of the find complexity, in general.
So how can we keep track of the array position of an element we've inserted? After all, further inserts/removals might move it around.
We can keep track by also storing a pointer to an element record next to the key, on the heap, for each element. The element record then contains a field with the current position - which thus has to be maintained by modified heap-insert and heap-swap functions. If we retain the pointer to the element record after insert, we can get the element's current position in the heap in constant time. Thus, in that way, we can also implement element removal in O(log n).
What you want to achieve is not a typical heap operation and it seems to me that once you introduce "delete middle element" as a method some other binary tree(for instance red-black or AVL tree) is a better choice. You have a red-black tree implemented in some languages(for instance map and set in c++).
Otherwise the way to do middle element deletion is as proposed in rejj's answer: assign a big value(for max heap) or small value(for min heap) to the element, sift it up until it is root and then delete it.
This approach still keeps the O(log(n)) complexity for middle element deletion, but the one you propose doesn't. It will have complexity O(n*log(n)) and therefor is not very good.
Hope that helps.

Generating uniformly random curious binary trees

A binary tree of N nodes is 'curious' if it is a binary tree whose node values are 1, 2, ..,N and which satisfy the property that
Each internal node of the tree has exactly one descendant which is greater than it.
Every number in 1,2, ..., N appears in the tree exactly once.
Example of a curious binary tree
4
/ \
5 2
/ \
1 3
Can you give an algorithm to generate a uniformly random curious binary tree of n nodes, which runs in O(n) guaranteed time?
Assume you only have access to a random number generator which can give you a (uniformly distributed) random number in the range [1, k] for any 1 <= k <= n. Assume the generator runs in O(1).
I would like to see an O(nlogn) time solution too.
Please follow the usual definition of labelled binary trees being distinct, to consider distinct curious binary trees.
There is a bijection between "curious" binary trees and standard heaps. Namely, given a heap, recursively (starting from the top) swap each internal node with its largest child. And, as I learned in StackOverflow not long ago, a heap is equivalent to a permutation of 1,2,...,N. So you should make a random permutation and turn it into a heap; or recursively make the heap in the same way that you would have made a random permutation. After that you can convert the heap to a "curious tree".
Aha, I think I've got how to create a random heap in O(N) time. (after which, use approach in Greg Kuperberg's answer to transform into "curious" binary tree.)
edit 2: Rough pseudocode for making a random min-heap directly. Max-heap is identical except the values inserted into the heap are in reverse numerical order.
struct Node {
Node left, right;
Object key;
constructor newNode() {
N = new Node;
N.left = N.right = null;
N.key = null;
}
}
function create-random-heap(RandomNumberGenerator rng, int N)
{
Node heap = Node.newNode();
// Creates a heap with an "incomplete" node containing a null, and having
// both child nodes as null.
List incompleteHeapNodes = [heap];
// use a vector/array type list to keep track of incomplete heap nodes.
for k = 1:N
{
// loop invariant: incompleteHeapNodes has k members. Order is unimportant.
int m = rng.getRandomNumber(k);
// create a random number between 0 and k-1
Node node = incompleteHeapNodes.get(m);
// pick a random node from the incomplete list,
// make it a complete node with key k.
// It is ok to do so since all of its parent nodes
// have values less than k.
node.left = Node.newNode();
node.right = Node.newNode();
node.key = k;
// Now remove this node from incompleteHeapNodes
// and add its children. (replace node with node.left,
// append node.right)
incompleteHeapNodes.set(m, node.left);
incompleteHeapNodes.append(node.right);
// All operations in this loop take O(1) time.
}
return prune-null-nodes(heap);
}
// get rid of all the incomplete nodes.
function prune-null-nodes(heap)
{
if (heap == null || heap.key == null)
return null;
heap.left = prune-null-nodes(heap.left);
heap.right = prune-null-nodes(heap.right);
}

Finding last element of a binary heap

quoting Wikipedia:
It is perfectly acceptable to use a
traditional binary tree data structure
to implement a binary heap. There is
an issue with finding the adjacent
element on the last level on the
binary heap when adding an element
which can be resolved
algorithmically...
Any ideas on how such an algorithm might work?
I was not able to find any information about this issue, for most binary heaps are implemented using arrays.
Any help appreciated.
Recently, I have registered an OpenID account and am not able to edit my initial post nor comment answers. That's why I am responding via this answer. Sorry for this.
quoting Mitch Wheat:
#Yse: is your question "How do I find
the last element of a binary heap"?
Yes, it is.
Or to be more precise, my question is: "How do I find the last element of a non-array-based binary heap?".
quoting Suppressingfire:
Is there some context in which you're
asking this question? (i.e., is there
some concrete problem you're trying to
solve?)
As stated above, I would like to know a good way to "find the last element of a non-array-based binary heap" which is necessary for insertion and deletion of nodes.
quoting Roy:
It seems most understandable to me to
just use a normal binary tree
structure (using a pRoot and Node
defined as [data, pLeftChild,
pRightChild]) and add two additional
pointers (pInsertionNode and
pLastNode). pInsertionNode and
pLastNode will both be updated during
the insertion and deletion subroutines
to keep them current when the data
within the structure changes. This
gives O(1) access to both insertion
point and last node of the structure.
Yes, this should work. If I am not mistaken, it could be a little bit tricky to find the insertion node and the last node, when their locations change to another subtree due to an deletion/insertion. But I'll give this a try.
quoting Zach Scrivena:
How about performing a depth-first
search...
Yes, this would be a good approach. I'll try that out, too.
Still I am wondering, if there is a way to "calculate" the locations of the last node and the insertion point. The height of a binary heap with N nodes can be calculated by taking the log (of base 2) of the smallest power of two that is larger than N. Perhaps it is possible to calculate the number of nodes on the deepest level, too. Then it was maybe possible to determine how the heap has to be traversed to reach the insertion point or the node for deletion.
Basically, the statement quoted refers to the problem of resolving the location for insertion and deletion of data elements into and from the heap. In order to maintain "the shape property" of a binary heap, the lowest level of the heap must always be filled from left to right leaving no empty nodes. To maintain the average O(1) insertion and deletion times for the binary heap, you must be able to determine the location for the next insertion and the location of the last node on the lowest level to use for deletion of the root node, both in constant time.
For a binary heap stored in an array (with its implicit, compacted data structure as explained in the Wikipedia entry), this is easy. Just insert the newest data member at the end of the array and then "bubble" it into position (following the heap rules). Or replace the root with the last element in the array "bubbling down" for deletions. For heaps in array storage, the number of elements in the heap is an implicit pointer to where the next data element is to be inserted and where to find the last element to use for deletion.
For a binary heap stored in a tree structure, this information is not as obvious, but because it's a complete binary tree, it can be calculated. For example, in a complete binary tree with 4 elements, the point of insertion will always be the right child of the left child of the root node. The node to use for deletion will always be the left child of the left child of the root node. And for any given arbitrary tree size, the tree will always have a specific shape with well defined insertion and deletion points. Because the tree is a "complete binary tree" with a specific structure for any given size, it is very possible to calculate the location of insertion/deletion in O(1) time. However, the catch is that even when you know where it is structurally, you have no idea where the node will be in memory. So, you have to traverse the tree to get to the given node which is an O(log n) process making all inserts and deletions a minimum of O(log n), breaking the usually desired O(1) behavior. Any search ("depth-first", or some other) will be at least O(log n) as well because of the traversal issue noted and usually O(n) because of the random nature of the semi-sorted heap.
The trick is to be able to both calculate and reference those insertion/deletion points in constant time either by augmenting the data structure ("threading" the tree, as mention in the Wikipedia article) or using additional pointers.
The implementation which seems to me to be the easiest to understand, with low memory and extra coding overhead, is to just use a normal simple binary tree structure (using a pRoot and Node defined as [data, pParent, pLeftChild, pRightChild]) and add two additional pointers (pInsert and pLastNode). pInsert and pLastNode will both be updated during the insertion and deletion subroutines to keep them current when the data within the structure changes. This implementation gives O(1) access to both insertion point and last node of the structure and should allow preservation of overall O(1) behavior in both insertion and deletions. The cost of the implementation is two extra pointers and some minor extra code in the insertion/deletion subroutines (aka, minimal).
EDIT: added pseudocode for an O(1) insert()
Here is pseudo code for an insert subroutine which is O(1), on average:
define Node = [T data, *pParent, *pLeft, *pRight]
void insert(T data)
{
do_insertion( data ); // do insertion, update count of data items in tree
# assume: pInsert points node location of the tree that where insertion just took place
# (aka, either shuffle only data during the insertion or keep pInsert updated during the bubble process)
int N = this->CountOfDataItems + 1; # note: CountOfDataItems will always be > 0 (and pRoot != null) after an insertion
p = new Node( <null>, null, null, null); // new empty node for the next insertion
# update pInsert (three cases to handle)
if ( int(log2(N)) == log2(N) )
{# #1 - N is an exact power of two
# O(log2(N))
# tree is currently a full complete binary tree ("perfect")
# ... must start a new lower level
# traverse from pRoot down tree thru each pLeft until empty pLeft is found for insertion
pInsert = pRoot;
while (pInsert->pLeft != null) { pInsert = pInsert->pLeft; } # log2(N) iterations
p->pParent = pInsert;
pInsert->pLeft = p;
}
else if ( isEven(N) )
{# #2 - N is even (and NOT a power of 2)
# O(1)
p->pParent = pInsert->pParent;
pInsert->pParent->pRight = p;
}
else
{# #3 - N is odd
# O(1)
p->pParent = pInsert->pParent->pParent->pRight;
pInsert->pParent->pParent->pRight->pLeft = p;
}
pInsert = p;
// update pLastNode
// ... [similar process]
}
So, insert(T) is O(1) on average: exactly O(1) in all cases except when the tree must be increased by one level when it is O(log N), which happens every log N insertions (assuming no deletions). The addition of another pointer (pLeftmostLeaf) could make insert() O(1) for all cases and avoids the possible pathologic case of alternating insertion & deletion in a full complete binary tree. (Adding pLeftmost is left as an exercise [it's fairly easy].)
My first time to participate in stack overflow.
Yes, the above answer by Zach Scrivena (god I don't know how to properly refer to other people, sorry) is right. What I want to add is a simplified way if we are given the count of nodes.
The basic idea is:
Given the count N of nodes in this full binary tree, do "N % 2" calculation and push the results into a stack. Continue the calculation until N == 1. Then pop the results out. The result being 1 means right, 0 means left. The sequence is the route from root to target position.
Example:
The tree now have 10 nodes, I want insert another node at position 11. How to route it?
11 % 2 = 1 --> right (the quotient is 5, and push right into stack)
5 % 2 = 1 --> right (the quotient is 2, and push right into stack)
2 % 2 = 0 --> left (the quotient is 1, and push left into stack. End)
Then pop the stack: left -> right -> right. This is the path from the root.
You could use the binary representation of the size of the Binary Heap to find the location of the last node in O(log N). The size could be stored and incremented which would take O(1) time. The the fundamental concept behind this is the structure of the binary tree.
Suppose our heap size is 7. The binary representation of 7 is, "111". Now, remember to always omit the first bit. So, now we are left with "11". Read from left-to-right. The bit is '1', so, go to the right child of the root node. Then the string left is "1", the first bit is '1'. So, again go to the right child of the current node you are at. As you no longer have bits to process, this indicates that you have reached the last node. So, the raw working of the process is that, convert the size of the heap into bits. Omit the first bit. According to the leftmost bit, go to the right child of the current node if it is '1', and to the left child of the current node if it is '0'.
As you always to to the very end of the binary tree this operation always takes O(log N) time. This is a simple and accurate procedure to find the last node.
You may not understand it in the first reading. Try working this method on the paper for different values of Binary Heap, I'm sure you'll get the intuition behind it. I'm sure this knowledge is enough to solve your problem, if you want more explanation with figures, you can refer to my blog.
Hope my answer has helped you, if it did, let me know...! ☺
How about performing a depth-first search, visiting the left child before the right child, to determine the height of the tree. Thereafter, the first leaf you encounter with a shorter depth, or a parent with a missing child would indicate where you should place the new node before "bubbling up".
The depth-first search (DFS) approach above doesn't assume that you know the total number of nodes in the tree. If this information is available, then we can "zoom-in" quickly to the desired place, by making use of the properties of complete binary trees:
Let N be the total number of nodes in the tree, and H be the height of the tree.
Some values of (N,H) are (1,0), (2,1), (3,1), (4,2), ..., (7,2), (8, 3).
The general formula relating the two is H = ceil[log2(N+1)] - 1.
Now, given only N, we want to traverse from the root to the position for the new node, in the least number of steps, i.e. without any "backtracking".
We first compute the total number of nodes M in a perfect binary tree of height H = ceil[log2(N+1)] - 1, which is M = 2^(H+1) - 1.
If N == M, then our tree is perfect, and the new node should be added in a new level. This means that we can simply perform a DFS (left before right) until we hit the first leaf; the new node becomes the left child of this leaf. End of story.
However, if N < M, then there are still vacancies in the last level of our tree, and the new node should be added to the leftmost vacant spot.
The number of nodes that are already at the last level of our tree is just (N - 2^H + 1).
This means that the new node takes spot X = (N - 2^H + 2) from the left, at the last level.
Now, to get there from the root, you will need to make the correct turns (L vs R) at each level so that you end up at spot X at the last level. In practice, you would determine the turns with a little computation at each level. However, I think the following table shows the big picture and the relevant patterns without getting mired in the arithmetic (you may recognize this as a form of arithmetic coding for a uniform distribution):
0 0 0 0 0 X 0 0 <--- represents the last level in our tree, X marks the spot!
^
L L L L R R R R <--- at level 0, proceed to the R child
L L R R L L R R <--- at level 1, proceed to the L child
L R L R L R L R <--- at level 2, proceed to the R child
^ (which is the position of the new node)
this column tells us
if we should proceed to the L or R child at each level
EDIT: Added a description on how to get to the new node in the shortest number of steps assuming that we know the total number of nodes in the tree.
Solution in case you don't have reference to parent !!!
To find the right place for next node you have 3 cases to handle
case (1) Tree level is complete Log2(N)
case (2) Tree node count is even
case (3) Tree node count is odd
Insert:
void Insert(Node root,Node n)
{
Node parent = findRequiredParentToInsertNewNode (root);
if(parent.left == null)
parent.left = n;
else
parent.right = n;
}
Find the parent of the node in order to insert it
void findRequiredParentToInsertNewNode(Node root){
Node last = findLastNode(root);
//Case 1
if(2*Math.Pow(levelNumber) == NodeCount){
while(root.left != null)
root=root.left;
return root;
}
//Case 2
else if(Even(N)){
Node n =findParentOfLastNode(root ,findParentOfLastNode(root ,last));
return n.right;
}
//Case 3
else if(Odd(N)){
Node n =findParentOfLastNode(root ,last);
return n;
}
}
To find the last node you need to perform a BFS (breadth first search) and get the last element in the queue
Node findLastNode(Node root)
{
if (root.left == nil)
return root
Queue q = new Queue();
q.enqueue(root);
Node n = null;
while(!q.isEmpty()){
n = q.dequeue();
if ( n.left != null )
q.enqueue(n.left);
if ( n.right != null )
q.enqueue(n.right);
}
return n;
}
Find the parent of the last node in order to set the node to null in case replacing with the root in removal case
Node findParentOfLastNode(Node root ,Node lastNode)
{
if(root == null)
return root;
if( root.left == lastNode || root.right == lastNode )
return root;
Node n1= findParentOfLastNode(root.left,lastNode);
Node n2= findParentOfLastNode(root.left,lastNode);
return n1 != null ? n1 : n2;
}
I know this is an old thread but i was looking for a answer to the same question. But i could not afford to do an o(log n) solution as i had to find the last node thousands of times in a few seconds. I did have a O(log n) algorithm but my program was crawling because of the number of times it performed this operation. So after much thought I did finally find a fix for this. Not sure if anybody things this is interesting.
This solution is O(1) for search. For insertion it is definitely less than O(log n), although I cannot say it is O(1).
Just wanted to add that if there is interest, i can provide my solution as well.
The solution is to add the nodes in the binary heap to a queue. Every queue node has front and back pointers.We keep adding nodes to the end of this queue from left to right until we reach the last node in the binary heap. At this point, the last node in the binary heap will be in the rear of the queue.
Every time we need to find the last node, we dequeue from the rear,and the second-to-last now becomes the last node in the tree.
When we want to insert, we search backwards from the rear for the first node where we can insert and put it there. It is not exactly O(1) but reduces the running time dramatically.

Resources