How to build a binary tree in O(N ) time? - algorithm

Following on from a previous question here I'm keen to know how to build a binary tree from an array of N unsorted large integers in order N time?

Unless you have some pre-conditions on the list that allow you to calculate the position in the tree for each item in constant time it is not possible to 'build', that is sequentially insert, items into a tree in O(N) time. Each insertion has to compare up to Log M times where M is the number of items already in the tree.

OK, just for completeness... The binary tree in question is built from an array and has a leaf for every array element. It keeps them in their original index order, not value order, so it doesn't magically let you sort a list in linear time. It also needs to be balanced.
To build such a tree in linear time, you can use a simple recursive algorithm like this (using 0-based indexes):
//build a tree of elements [start, end) in array
//precondition: end > start
buildTree(int[] array, int start, int end)
{
if (end-start > 1)
{
int mid = (start+end)>>1;
left = buildTree(array, start, mid);
right = buildTree(array, mid, end);
return new InternalNode(left,right);
}
else
{
return new LeafNode(array[start]);
}
}

I agree that this seems impossible in general (assuming we have a general, totally ordered set S of N items.) Below is an informal argument where I essentially reduce the building of a BST on S to the problem of sorting S.
Informal argument. Let S be a set of N elements. Now construct a binary search tree T that stores items from S in O(N) time.
Now do an inorder walk of the tree and print values of the leaves as you visit them. You essentially sorted the elements from S. This took you O(|T|) steps, where |T| is the size of the tree (i.e. the number of nodes). (The size of the BST is O(N log N) in the worst case.)
If |T|=o(N log N) then you just solved the general sorting problem in o(N log N) time which is a contradiction.

I have an idea, how it is possible.
Sort array with RadixSort, this is O(N). Thereafter, use recursive procedure to insert into leafs, like:
node *insert(int *array, int size) {
if(size <= 0)
return NULL;
node *rc = new node;
int midpoint = size / 2;
rc->data = array[midpoint];
rc->left = insert(array, midpoint);
rc->right = insert(array + midpoint + 1, size - midpoint - 1);
return rc;
}
Since we do not iterate tree from up to down, but always attach nodes to a current leafs, this is also O(1).

Related

Top and Bottom 5% elements from data stream

Given a digital input stream,
The average value of the last last k value is required, and in the calculation,
it is required to remove top5% and bottom5% of the k numbers.
Can we do it in linear time. A O(n log k) solution would be through using priority queues but I am not able to think of a more optimal solution.
Here is O(n log(k)) approach. I used a deque to hold the last k values in input order and a ordered set to maintain a sorted version of the last k values:
deque<T> d;
set<T> s; // e.g., red-black tree
for each new value x {
d.push_back(x);
s.insert(x);
if (d.size() > k) {
old = d.front(); d.pop_front();
s.erase(old);
// s holds sorted k-values
// traverse to find mean
// traverse in order, pass over the first and last 0.05*k values
}
}

What would be the complexity of this Sorting Algorithm? What are the demerits of using the same?

The sorting algorithm can be described as follows:
1. Create Binary Search Tree from the Array data.
(For multiple occurences, increment occurence variable of the current Node)
2. Traverse BST in inorder fashion.
(Inorder traversal will return Sorted order of elements in array).
3. At each node in inorder traversal, overwrite the array element at current index(index beginning at 0) with current node value.
Here's a Java implementation for the same:
Structure of Node Class
class Node {
Node left;
int data;
int occurence;
Node right;
}
inorder function
(returning type is int just for obtaining correct indices at every call, they serve no other purpose)
public int inorder(Node root,int[] arr,int index) {
if(root == null) return index;
index = inorder(root.left,arr,index);
for(int i = 0; i < root.getOccurence(); i++)
arr[index++] = root.getData();
index = inorder(root.right,arr,index);
return index;
}
main()
public static void main(String[] args) {
int[] arr = new int[]{100,100,1,1,1,7,98,47,13,56};
BinarySearchTree bst = new BinarySearchTree(new Node(arr[0]));
for(int i = 1; i < arr.length; i++)
bst.insert(bst.getRoot(),arr[i]);
int dummy = bst.inorder(bst.getRoot(),arr,0);
System.out.println(Arrays.toString(arr));
}
The space complexity is terrible, I know, but it should not be such a big issue unless the sort is used for an extremely HUGE dataset. However, as I see it, isn't Time Complexity O(n)? (Insertions and Retrieval from BST is O(log n), and each element is touched once, making it O(n)). Correct me if I am wrong as I haven't yet studied Big-O well.
Assuming that the amortized (average) complexity of an insertion is O(log n), then N inserts (construction of the tree) will give O(log(1) + log(2) + ... + log(N-1) + log(N) = O(log(N!)) = O(NlogN) (Stirling's theorem). To read back the sorted array, perform an in-order depth-first traversal, which visits each node once, and is hence O(N). Combining the two you get O(NlogN).
However this requires that the tree is always balanced! This will not be the case in general for the most basic binary tree, as insertions do not check the relative depths of each child tree. There are many variants which are self-balancing - the two most famous being Red-Black trees and AVL trees. However the implementation of balancing is quite complicated and often leads to a higher constant factor in real-life performance.
the goal was to implement an O(n) algorithm to sort an Array of n elements with each element in the range [1, n^2]
In that case Radix sort (counting variation) would be O(n), taking a fixed number of passes (logb(n^2)), where b is the "base" used for the field, and b a function of n, such as b == n, where it would take two passes, or b == sqrt(n), where it would take four passes, or if n is small enough, b == n^2 in where it would take one pass and counting sort could be used. b could be rounded up to the next power of 2 in order to replace division and modulo with binary shift and binary and. Radix sort needs O(n) extra space, but so do the links for a binary tree.

Time Complexity for Finding the Minimum Value of a Binary Tree

I wrote a recursive function for finding the min value of a binary tree (assume that it is not ordered).
The code is as below.
//assume node values are positive int.
int minValue (Node n) {
if(n == null) return 0;
leftmin = minValue(n.left);
rightmin = minValue(n.right);
return min(n.data, leftmin, rightmin);
}
int min (int a, int b, int c) {
int min = 0;
if(b != 0 && c != 0) {
if(a<=b) min =a;
else min =b;
if(min<=c) return min;
else return c;
}
if(b==0) {
if(a<=c) return a;
else return c;
}
if(c==0) {
if(a<=b) return a;
else return b;
}
}
I guess the time complexity of the minValue function is O(n) by intuition.
Is this correct? Can someone show the formal proof of the time complexity of minValue function?
Assuming your binary tree is not ordered, then your search algorithm will have O(N) running time, so your intuition is correct. The reason it will take O(N) is that you will, on average, have to search half the nodes in the tree to find an input. But this assumes that the tree is completely unordered.
For a sorted and balanced binary tree, searching will take O(logN). The reason for this is that the search will only ever have to traverse one single path down the tree. A balanced tree with N nodes will have a height of log(N), and this explains the complexity for searching. Consider the following tree for example:
5
/ \
3 7
/ \ / \
1 4 6 8
There are 8 (actually 7) nodes in the tree, but the height is only log(8) = 2. You can convince yourself that you will only ever have to traverse this tree once to find a value or fail doing so.
Note that for a binary tree which is not balanced these complexities may not apply.
The number of comparisons is n-1. The proof is an old chestnut, usually applied to the problem of saying how many matches are needed in a single-elimination tennis match. Each comparison removes exactly one number from consideration and so if there's initially n numbers in the tree, you need n-1 comparisons to reduce that to 1.
You can lookup and remove the min/max of a BST in constant time O(1), if you implement it yourself and store a reference to head/tail. Most implementations don't do that, only storing the root-node. But if you analyze how a BST works, given a ref to min/max (or aliased as head/tail), then you can find the next min/max in constant time.
See this for more info:
https://stackoverflow.com/a/74905762/1223975

Generating uniformly random curious binary trees

A binary tree of N nodes is 'curious' if it is a binary tree whose node values are 1, 2, ..,N and which satisfy the property that
Each internal node of the tree has exactly one descendant which is greater than it.
Every number in 1,2, ..., N appears in the tree exactly once.
Example of a curious binary tree
4
/ \
5 2
/ \
1 3
Can you give an algorithm to generate a uniformly random curious binary tree of n nodes, which runs in O(n) guaranteed time?
Assume you only have access to a random number generator which can give you a (uniformly distributed) random number in the range [1, k] for any 1 <= k <= n. Assume the generator runs in O(1).
I would like to see an O(nlogn) time solution too.
Please follow the usual definition of labelled binary trees being distinct, to consider distinct curious binary trees.
There is a bijection between "curious" binary trees and standard heaps. Namely, given a heap, recursively (starting from the top) swap each internal node with its largest child. And, as I learned in StackOverflow not long ago, a heap is equivalent to a permutation of 1,2,...,N. So you should make a random permutation and turn it into a heap; or recursively make the heap in the same way that you would have made a random permutation. After that you can convert the heap to a "curious tree".
Aha, I think I've got how to create a random heap in O(N) time. (after which, use approach in Greg Kuperberg's answer to transform into "curious" binary tree.)
edit 2: Rough pseudocode for making a random min-heap directly. Max-heap is identical except the values inserted into the heap are in reverse numerical order.
struct Node {
Node left, right;
Object key;
constructor newNode() {
N = new Node;
N.left = N.right = null;
N.key = null;
}
}
function create-random-heap(RandomNumberGenerator rng, int N)
{
Node heap = Node.newNode();
// Creates a heap with an "incomplete" node containing a null, and having
// both child nodes as null.
List incompleteHeapNodes = [heap];
// use a vector/array type list to keep track of incomplete heap nodes.
for k = 1:N
{
// loop invariant: incompleteHeapNodes has k members. Order is unimportant.
int m = rng.getRandomNumber(k);
// create a random number between 0 and k-1
Node node = incompleteHeapNodes.get(m);
// pick a random node from the incomplete list,
// make it a complete node with key k.
// It is ok to do so since all of its parent nodes
// have values less than k.
node.left = Node.newNode();
node.right = Node.newNode();
node.key = k;
// Now remove this node from incompleteHeapNodes
// and add its children. (replace node with node.left,
// append node.right)
incompleteHeapNodes.set(m, node.left);
incompleteHeapNodes.append(node.right);
// All operations in this loop take O(1) time.
}
return prune-null-nodes(heap);
}
// get rid of all the incomplete nodes.
function prune-null-nodes(heap)
{
if (heap == null || heap.key == null)
return null;
heap.left = prune-null-nodes(heap.left);
heap.right = prune-null-nodes(heap.right);
}

Find median value from a growing set

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.
You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).
Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.
An example:
Start with an empty set,
when 12 is added, the median is 12,
when 7 is added, the median is 7,
when 8 is added, the median is 8,
when 11 is added, the median is 8,
when 5 is added, the median is 8,
when 16 is added, the median is 8,
...
Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.
My answer.
Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.
The WORST time complexity is: log_n + n + 1.
Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.
The WORST time complexity is: n + 1 + n/2.
The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.
If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).
So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.
So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?
Steve.
Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.
I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).
This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)
If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.
I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.
I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .
BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.
Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.
We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)
We can now access the Kth smallest element in O(log n) using these counts:
def get_kth_item(subtree, k):
left_size = 0 if subtree.left is None else subtree.left.size
if k < left_size:
return get_kth_item(subtree.left, k)
elif k == left_size:
return subtree.value
else: # k > left_size
return get_kth_item(subtree.right, k-1-left_size)
A median is a special case of Kth smallest element (given that you know the size of the set).
So all in all this is another O(log n) solution.
We can difine a min and max heap to store numbers. Additionally, we define a class DynamicArray for the number set, with two functions: Insert and Getmedian. Time to insert a new number is O(lgn), while time to get median is O(1).
This solution is implemented in C++ as the following:
template<typename T> class DynamicArray
{
public:
void Insert(T num)
{
if(((minHeap.size() + maxHeap.size()) & 1) == 0)
{
if(maxHeap.size() > 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(), maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
}
else
{
if(minHeap.size() > 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(), minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
}
}
int GetMedian()
{
int size = minHeap.size() + maxHeap.size();
if(size == 0)
throw exception("No numbers are available");
T median = 0;
if(size & 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0]) / 2;
return median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
For more detailed analysis, please refer to my blog: http://codercareer.blogspot.com/2012/01/no-30-median-in-stream.html.
1) As with the previous suggestions, keep two heaps and cache their respective sizes. The left heap keeps values below the median, the right heap keeps values above the median. If you simply negate the values in the right heap the smallest value will be at the root so there is no need to create a special data structure.
2) When you add a new number, you determine the new median from the size of your two heaps, the current median, and the two roots of the L&R heaps, which just takes constant time.
3) Call a private threaded method to perform the actual work to perform the insert and update, but return immediately with the new median value. You only need to block until the heap roots are updated. Then, the thread doing the insert just needs to maintain a lock on the traversing grandparent node as it traverses the tree; this will ensue that you can insert and rebalance without blocking other inserting threads working on other sub-branches.
Getting the median becomes a constant time procedure, of course now you may have to wait on synchronization from further adds.
Rob
A balanced tree (e.g. R/B tree) with augmented size field should find the median in lg(n) time in the worst case. I think it is in Chapter 14 of the classic Algorithm text book.
To keep the explanation brief, you can efficiently augment a BST to select a key of a specified rank in O(h) by having each node store the number of nodes in its left subtree. If you can guarantee that the tree is balanced, you can reduce this to O(log(n)). Consider using an AVL which is height-balanced (or red-black tree which is roughly balanced), then you can select any key in O(log(n)). When you insert or delete a node into the AVL you can increment or decrement a variable that keeps track of the total number of nodes in the tree to determine the rank of the median which you can then select in O(log(n)).
In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.
typedef struct
{
int number;
int lesser;
int greater;
} record;
int median(record numbers[], int count, int n)
{
int i;
int m = VERY_BIG_NUMBER;
int a, b;
numbers[count + 1].number = n:
for (i = 0; i < count + 1; i++)
{
if (n < numbers[i].number)
{
numbers[i].lesser++;
numbers[count + 1].greater++;
}
else
{
numbers[i].greater++;
numbers[count + 1].lesser++;
}
if (numbers[i].greater - numbers[i].lesser == 0)
m = numbers[i].number;
}
if (m == VERY_BIG_NUMBER)
for (i = 0; i < count + 1; i++)
{
if (numbers[i].greater - numbers[i].lesser == -1)
a = numbers[i].number;
if (numbers[i].greater - numbers[i].lesser == 1)
b = numbers[i].number;
m = (a + b) / 2;
}
return m;
}
What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

Resources