Binary Tree array list representation - algorithm

I have been doing some research on Binary trees, and the array list representation. I am struggling to understand that the worst case space complexity is O(2^n). Specifically, the book states, the space usage is O(N) (N = array size), which is O(2^n) in the worst case . I would have thought it would have been 2n in the worst case as each node has two children (indexes) not O(2^n), where n = no. of elements.
an example, if I had a binary tree with 7 nodes, then the space would be 2n = 14 not 2^n = 128.

This is Heap implementation on an array. Where
A[1..n]
left_child(i) = A[2*i]
right_child(i) = A[2*i+1]
parent(i) = A[floor(i/2)]
Now, come to space. Think intuitively,
when you insert first element n=1, location=A[1], similarly,
n=2 #A[2] left_child(1)
n=3 #A[3] right_child(1)
n=4 #A[4] left_child(2)
n=5 #A[5] right_child(2)
You see, nth element will go into A[n]. So space complexity is O(n).
When you code you just plug-in the element to be inserted in the end say at A[n+1], and say that it's a child of floor((n+1)/2).
Refer: http://en.wikipedia.org/wiki/Binary_heap#Heap_implementation
Heap is a nearly complete tree, so total number of elements in the tree would be 2h-1 < n <= 2h+1-1 and this is what the length of array you will need. Refer: this

The worst case space complexity of a binary tree is O(n) (not O(2^n) in your question), but using arrays to represent binary trees can save the space of pointers if it's nearly a complete binary tree.
See http://en.wikipedia.org/wiki/Binary_tree#Arrays

I think this refers to storing arbitrary binary trees in an array representation, which is normally used for complete or nearly complete binary trees, notably in the implementation of heaps.
In this representation, the root is stored at index 0 in the array, and for any node with index n, its left and right children are stored at indices 2n+1 and 2n+2, respectively.
If you have a degenerate tree where no nodes have any right children (the tree is effectively a linked list), then the first items will be stored at indices 0, 1, 3, 7, 15, 31, .... In general, the nth item of this list (starting from 0) will be stored at index 2n-1, so in this case the array representation requires θ(2n) space.

Related

Smallest missing number at any point in time in a stream of positive numbers

We are processing a stream of positive integers. At any point in time, we can be asked a query to which the answer is the smallest positive number that we have not seen yet.
One can assume two APIs.
void processNext(int val)
int getSmallestNotSeen()
We can assume the numbers to be bounded by the range [1,10^6]. Let this range be N.
Here is my solution.
Let's take an array of size 10^6. Whenever processNext(val) is called we mark the array[val] to be 1. We make a sum segment tree on this array. This will be a point update in the segment tree. Whenever getSmallestNotSeen() is called I find the smallest index j such that sum [1..j] is less than j. I find j using binary search. processNext(val) -> O(1) getSmallestNotSeen() -> O((logN)^2)
I was thinking maybe if there was something more optimal. Or the above solution can be improved.
Make a map of id - > node (nodes of a doubly-linked list) and initialize for 10^6 nodes, each pointing to its neighbors. Initialize the min to one.
processNext(val): check if the node exists. If it does, delete it and point its neighbors at each other. If the node you delete has no left neighbor (i.e. was smallest), update the min to be the right neighbor.
getSmallestNotSeen(): return the min
The preprocessing is linear time and linear memory. Everything after that is constant time.
In case the number of processNext calls (i.e. the length of the stream) is fairly small compared with the range of N, then space usage could be limited by storing consecutive ranges of numbers, instead of all possible individual numbers. This is also interesting when N could be a much larger range, like [1, 264-1]
Data structure
I would suggest a binary search tree with such [start, end] ranges as elements, and self-balancing (like AVL, red-black, ...).
Algorithm
Initialise the tree with one (root) node: [1, Infinity]
Whenever a new value val is pulled with processNext, find the range [start, end] that includes val, using binary search.
If the range has size 1 (and thus only contains val), perform a deletion of that node (according to the tree rules)
Else if val is a bounding value of the range, then just update the range in that node, excluding val.
Otherwise split the range into two. Update the node with one of the two ranges (decide by the balance information) and let the other range sift down to a new leaf (and rebalance if needed).
In the tree maintain a reference to the node having the least start value. Only when this node gets deleted during processNext it will need a traversal up or down the tree to find the next (in order) node. When the node splits (see above) and it is decided the put the lower part in a new leaf, the reference needs to be updated to that leaf.
The getSmallestNotSeen function will return the start-value from that least-range node.
Time & Space Complexity
The space complexity is O(S), where S is the length of the stream
The time complexity of processNext is O(log(S))
The time complexity of getSmallestNotSeen is O(1)
The best case space and time complexity is O(1). Such a best case occurs when the stream has consecutive integers (increasing or decreasing)
bool array[10^6] = {false, false, ... }
int min = 1
void processNext(int val) {
array[val] = true // A
while (array[min]) // B
min++ // C
}
int getSmallestNotSeen() {
return min
}
Time complexity:
processNext: amortised O(1)
getSmallestNotSeen: O(1)
Analysis:
If processNext is invoked k times and n is the highest value stored in min (which could be returned in getSmallestNotSeen), then:
the line A will be executed exactly k times,
the line B will be executed exactly k + n times, and
the line C will be executed exactly n times.
Additionally, n will never be greater than k, because for min to reach n there needs to be a continous range of n true's in the array, and there can be only k true's in the array in total. Therefore, line B can be executed at most 2 * k times and line C at most k times.
Space complexity:
Instead of an array it is possible to use a HashMap without any additional changes in the pseudocode (non-existing keys in the HashMap should evaluate to false). Then the space complexity is O(k). Additionally, you can prune keys smaller than min, thus saving space in some cases:
HashMap<int,bool> map
int min = 1
void processNext(int val) {
if (val < min)
return
map.put(val, true)
while (map.get(min) = true)
map.remove(min)
min++
}
int getSmallestNotSeen() {
return min
}
This pruning technique might be most effective if the stream values increase steadily.
Your solution takes O(N) space to hold the array and the sum segment tree, and O(N) time to initialise them; then O(1) and O(log² N) for the two queries. It seems pretty clear that you can't do better than O(N) space in the long run to keep track of which numbers are "seen" so far, if there are going to be a lot of queries.
However, a different data structure can improve on the query times. Here are three ideas:
Self-balancing binary search tree
Initialise the tree to contain every number from 1 to N; this can be done in O(N) time by building the tree from the leaves up; the leaves have all the odd numbers, then they're joined by all the numbers which are 2 mod 4, then those are joined by the numbers which are 4 mod 8, and so on. The tree takes O(N) space.
processNext is implemented by removing the number from the tree in O(log N) time.
getSmallestNotSeen is implemented by finding the left-most node in O(log N) time.
This is an improvement if getSmallestNotSeen is called many times, but if getSmallestNotSeen is rarely called then your solution is better because it does processNext in O(1) rather than O(log N).
Doubly-linked list
Initialise a doubly-linked list containing the numbers 1 to N in order, and create an array of size N holding pointers to each node. This takes O(N) space and is done in O(N) time. Initialise a variable holding a cached minimum value to be 1.
processNext is implemented by looking up the corresponding list node in the array, and deleting it from the list. If the deleted node has no predecessor, update the cached minimum value to be the value held by the successor node. This is O(1) time.
getSmallestNotSeen is implemented by returning the cached minimum, in O(1) time.
This is also an improvement, and is strictly better asymptotically, although the constants involved might be higher; there's a lot of overhead to hold an array of size N and also a doubly-linked list of size N.
Hash-set
The time requirements for the other solutions are largely determined by their initialisation stages, which take O(N) time. Initialising an empty hash-set, on the other hand, is O(1). As before, we also initialise a variable holding a current minimum value to be 1.
processNext is implemented by inserting the number into the set, in O(1) amortised time.
getSmallestNotSeen updates the current minimum by incrementing it until it's no longer in the set, and then returns it. Membership tests on a hash-set are O(1), and the number of increments over all queries is limited by the number of times processNext is called, so this is also O(1) amortised time.
Asymptotically, this solution takes O(1) time for initialisation and queries, and it uses O(min(Q,N)) space where Q is the number of queries, while the other solutions use O(N) space regardless.
I think it should be straightforward to prove that O(min(Q,N)) space is asymptotically optimal, so the hash-set turns out to be the best option. Credit goes to Dave for combining the hash-set with a current-minimum variable to do getSmallestNotSeen in O(1) amortised time.

Lower bound of merging k sorted arrays of size n

As the title suggests, I am wondering what the proof for the lower bound of merging k sorted arrays of size n is? I know that the bound is O(kn*log[k]), but how was this achieved? I tried comparing to sorting an array of p elements using a decision tree but I don't see how to implement this proof.
This is pretty much easy to prove, try to think about it in a merge-sort way. To merge-sort an array of size K*N it takes O(KN*log(K*N)).
But we don't have to reach leafs of size 1, as we know when the array size is N it is sorted. For simplicity we will assume K is a power of 2.
How many times do we have to divide by 2 to reach leafs of size N ?
K times!
Visualization
So you have log(k) steps, then having to merge each step costs KN, and there are log(k) steps. Hence, the time complexity is O(NK(log(K))
Proof: Lets assume it is not a lower bound and we could achieve better. Then for any unknown array of size N*K we could split it in 2 until we reach sub-arrays of size N, merge-sort each of the arrays of size N in Nlog(N) time and in total for all the arrays K*N*log(N) time.
After having the K arrays of size N sorted, we have to merge them into a bigger array of size N*K, pay less than O(NK*(log(K)) as we assumed it is not the lower bound.
At the end you sorted an unknown array of size N*K in a complexity lesser than N*K*log(N*K) which is not possible in the comparison model.
Hence, you can't achieve better than O(NK*(log(K)) while merging the K sorted arrays of size N.
Possible implementation.
Let's create a heap data structure that store pairs (element, arrayIndex) ordered by element. Then
Add the first element of each array with the corresponding array index to this heap.
On each step, remove the top (lowest) pair p from the heap, add p.element to the result, and insert to the heap the pair (next, p.arrayIndex) with the next element from the array with p.arrayIndex index (if it is not empty).
For tracking 'next' element you need an array with k indices/pointers/iterators that are pointing to the next element of the corresponding array.
There will be at most k elements in the heap at any time, thus the insert/remove operations of the heap will have O(log(k)) complexity. Every element will be inserted and removed once from the heap. The number of elements is n*k. Overall complexity is O(n*k*log(k)).
Create a min heap of size k which stores the next item from each of the k arrays. Each node also stores which array it came from. Create your sorted array by adding the min from the heap to final_sorted_array, then adding the next element from the array that value came from to the heap.
Removing the min elt of the heap is O(log k). You have a total of NK elements so you do this NK times. Final result: O(NK log k).

Cracking the Coding Interview 6th Edition: 10.10. Rank from Stream

The problem statement is as follows:
Imagine you are reading in a stream of integers. Periodically, you
wish to be able to look up the rank of a number x (the number of
values less than or equal to x). Implement the data structures and
algorithms to support these operations.That is, implement the method
track (in t x), which is called when each number is generated, and the
method getRankOfNumber(int x) , which returns the number of values
less than or equal to X (not including x itself).
EXAMPLE: Stream(in order of appearance): 5, 1, 4, 4, 5, 9, 7, 13, 3
getRankOfNumber(1) = 0 getRankOfNumber(3) = 1 getRankOfNumber(4) = 3
The suggested solution uses a modified Binary Search Tree, where each node stores stores the number of nodes to the left of that node. The time complexity for both methods is is O(logN) for balanced tree and O(N) for unbalanced tree, where N is the number of nodes.
But how can we construct a balanced BST from a stream of random integers? Won't the tree become unbalanced in due time if we keep on adding to the same tree and the root is not the median? Shouldn't the worst case complexity be O(N) for this solution (in which case a HashMap with O(1) and O(N) for track() and getRankOfNumber() respectively would be better)?
you just need to build an AVL or Red-Black Tree to have the O(lg n) complexities you desire.
about the rank, its kind of simple. Let's call count(T) the number of elements of a tree with root T.
the rank of a node N will be:
firstly there will be count(N's left subtree) nodes before N (elements smaller than N)
let A = N's father. If N is right son of A, then there will be 1 + count(A's left subtree) nodes before N
if A is right son of some B, then there will be 1 + count(B's left subtree) nodes before N
recursively, run all the way up until you reach the root or until the node you are in isn't someone's right son.
as the height of a balanced tree is at most lg(n), this method will take O(lg n) to return you someone's rank ( O(lg n) to find + O(lg n) to run back and measure the rank ), but this taking in consideration that all nodes store the sizes of their left and right subtrees.
hope that helps :)
Building a Binary Search Tree (BST) using the stream of numbers should be easier to imagine. All the values less than the node, goes to the left and all the values greater than the node, goes to the right.
Then Rank of any x will be number of nodes in left subtree of that node with value x.
Operations to be done: Find the node with Value x O(logN) + Count Nodes of left Subtree of node found O(logN) = Total O(logN + logN) = O(logN)
In case to optimize searching of counts of node of left subtree from O(logN) to O(1), you can keep another class variable 'leftSubTreeSize' in Node class, and populate it during insertion of a node.

Algorithm which finds biggest n nodes in a tree

Lets assume that we have a tree which nodes hold some numbers.
I need to find n biggest numbers in this tree.
I have two algorithms on my mind:
1. Using BFS or DFS iterate over tree and put it's nodes in an array and then sort it using quick sort as example and return n first elements.
Time complexity of this method is O(|V| + |E| + |V|log|V|) spatial is O(|V|)
2. Second is to iterate over tree finding maximum element and marking it n times. So time complexity is O(N*(|V| + |E|)) spatial is O(|V|) too.
Which solution is better and maybe im on the wrong way and there is a much better solution?
And a standard heap selection algorithm won't work?
The basic algorithm is (assuming that k is the number of items you want to select)
create an empty min-heap
for each node (depth-first search)
if heap.count < k
heap.Add(node)
else if node.Value < heap.Peek.Value()
heap.RemoveSmallest()
heap.Add(node)
When the for loop is done, your heap contains the k largest values. You can obtain them in ascending order with:
while heap.count > 0
output (heap.RemoveSmallest().Value)
If you want them in ascending order, remove them from the heap as above into an array, and then reverse the array.
This algorithm is O(n log k), where n is the number of nodes in the tree, and k is the number of items you want.

Find algorithm in 2-3 BST tree

Is there an algorithm that with a given 2-3 tree T and a pointer to some node v in said tree, the algo can change the key of the node v so T would remain a legal 2-3 tree, in O(logn/loglogn) amortized efficiency?
No.
Assume it was possible, with the algorithm f, we will show we can sort an array with O(n*logn/loglogn) time complexity.
sort array A of length n:
(1) Create an 2-3 tree of size n, with no importance to keys. let it be T.
(2) store all pointers to nodes in T in a second array B.
(3) for each i from 0 to n:
(3.1) f(B[i],A[i]) //modify the tree: pointer: B[i] new value: A[i]
(4) extract elements from T back to A inorder.
correctness:
After each activation of f the tree is legal. After finishing activating f on all elements of T and all elements of A, the tree is legal and contains all elements. Thus, extracting elements from A, we get back the sorted array.
complexity:
(1)Creating a tree [no importance which keys we put] is O(n) we can put 0 in all elements, it doesn't matter
(2)iterating T and creating B is O(n)
(3)activating f is O(logn/loglogn), thus invoking it n times is O(n*logn/loglogn)
(4) extracting elements is just a traversal: O(n)
Thus: total complexity is O(n*logn/loglogn)
But sorting is an Omega(nlogn) problem with comparisons based algorithms. contradiction.
Conclusion: desired f doesn't exist.

Resources