Find median value from a growing set - algorithm

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.
You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).
Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.
An example:
Start with an empty set,
when 12 is added, the median is 12,
when 7 is added, the median is 7,
when 8 is added, the median is 8,
when 11 is added, the median is 8,
when 5 is added, the median is 8,
when 16 is added, the median is 8,
...
Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.
My answer.
Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.
The WORST time complexity is: log_n + n + 1.
Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.
The WORST time complexity is: n + 1 + n/2.
The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.
If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).
So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.
So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?
Steve.

Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.
I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).
This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)
If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.

I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.
I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .
BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.

Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.
We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)
We can now access the Kth smallest element in O(log n) using these counts:
def get_kth_item(subtree, k):
left_size = 0 if subtree.left is None else subtree.left.size
if k < left_size:
return get_kth_item(subtree.left, k)
elif k == left_size:
return subtree.value
else: # k > left_size
return get_kth_item(subtree.right, k-1-left_size)
A median is a special case of Kth smallest element (given that you know the size of the set).
So all in all this is another O(log n) solution.

We can difine a min and max heap to store numbers. Additionally, we define a class DynamicArray for the number set, with two functions: Insert and Getmedian. Time to insert a new number is O(lgn), while time to get median is O(1).
This solution is implemented in C++ as the following:
template<typename T> class DynamicArray
{
public:
void Insert(T num)
{
if(((minHeap.size() + maxHeap.size()) & 1) == 0)
{
if(maxHeap.size() > 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(), maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
}
else
{
if(minHeap.size() > 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(), minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
}
}
int GetMedian()
{
int size = minHeap.size() + maxHeap.size();
if(size == 0)
throw exception("No numbers are available");
T median = 0;
if(size & 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0]) / 2;
return median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
For more detailed analysis, please refer to my blog: http://codercareer.blogspot.com/2012/01/no-30-median-in-stream.html.

1) As with the previous suggestions, keep two heaps and cache their respective sizes. The left heap keeps values below the median, the right heap keeps values above the median. If you simply negate the values in the right heap the smallest value will be at the root so there is no need to create a special data structure.
2) When you add a new number, you determine the new median from the size of your two heaps, the current median, and the two roots of the L&R heaps, which just takes constant time.
3) Call a private threaded method to perform the actual work to perform the insert and update, but return immediately with the new median value. You only need to block until the heap roots are updated. Then, the thread doing the insert just needs to maintain a lock on the traversing grandparent node as it traverses the tree; this will ensue that you can insert and rebalance without blocking other inserting threads working on other sub-branches.
Getting the median becomes a constant time procedure, of course now you may have to wait on synchronization from further adds.
Rob

A balanced tree (e.g. R/B tree) with augmented size field should find the median in lg(n) time in the worst case. I think it is in Chapter 14 of the classic Algorithm text book.

To keep the explanation brief, you can efficiently augment a BST to select a key of a specified rank in O(h) by having each node store the number of nodes in its left subtree. If you can guarantee that the tree is balanced, you can reduce this to O(log(n)). Consider using an AVL which is height-balanced (or red-black tree which is roughly balanced), then you can select any key in O(log(n)). When you insert or delete a node into the AVL you can increment or decrement a variable that keeps track of the total number of nodes in the tree to determine the rank of the median which you can then select in O(log(n)).

In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.
typedef struct
{
int number;
int lesser;
int greater;
} record;
int median(record numbers[], int count, int n)
{
int i;
int m = VERY_BIG_NUMBER;
int a, b;
numbers[count + 1].number = n:
for (i = 0; i < count + 1; i++)
{
if (n < numbers[i].number)
{
numbers[i].lesser++;
numbers[count + 1].greater++;
}
else
{
numbers[i].greater++;
numbers[count + 1].lesser++;
}
if (numbers[i].greater - numbers[i].lesser == 0)
m = numbers[i].number;
}
if (m == VERY_BIG_NUMBER)
for (i = 0; i < count + 1; i++)
{
if (numbers[i].greater - numbers[i].lesser == -1)
a = numbers[i].number;
if (numbers[i].greater - numbers[i].lesser == 1)
b = numbers[i].number;
m = (a + b) / 2;
}
return m;
}
What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

Related

Smallest missing number at any point in time in a stream of positive numbers

We are processing a stream of positive integers. At any point in time, we can be asked a query to which the answer is the smallest positive number that we have not seen yet.
One can assume two APIs.
void processNext(int val)
int getSmallestNotSeen()
We can assume the numbers to be bounded by the range [1,10^6]. Let this range be N.
Here is my solution.
Let's take an array of size 10^6. Whenever processNext(val) is called we mark the array[val] to be 1. We make a sum segment tree on this array. This will be a point update in the segment tree. Whenever getSmallestNotSeen() is called I find the smallest index j such that sum [1..j] is less than j. I find j using binary search. processNext(val) -> O(1) getSmallestNotSeen() -> O((logN)^2)
I was thinking maybe if there was something more optimal. Or the above solution can be improved.
Make a map of id - > node (nodes of a doubly-linked list) and initialize for 10^6 nodes, each pointing to its neighbors. Initialize the min to one.
processNext(val): check if the node exists. If it does, delete it and point its neighbors at each other. If the node you delete has no left neighbor (i.e. was smallest), update the min to be the right neighbor.
getSmallestNotSeen(): return the min
The preprocessing is linear time and linear memory. Everything after that is constant time.
In case the number of processNext calls (i.e. the length of the stream) is fairly small compared with the range of N, then space usage could be limited by storing consecutive ranges of numbers, instead of all possible individual numbers. This is also interesting when N could be a much larger range, like [1, 264-1]
Data structure
I would suggest a binary search tree with such [start, end] ranges as elements, and self-balancing (like AVL, red-black, ...).
Algorithm
Initialise the tree with one (root) node: [1, Infinity]
Whenever a new value val is pulled with processNext, find the range [start, end] that includes val, using binary search.
If the range has size 1 (and thus only contains val), perform a deletion of that node (according to the tree rules)
Else if val is a bounding value of the range, then just update the range in that node, excluding val.
Otherwise split the range into two. Update the node with one of the two ranges (decide by the balance information) and let the other range sift down to a new leaf (and rebalance if needed).
In the tree maintain a reference to the node having the least start value. Only when this node gets deleted during processNext it will need a traversal up or down the tree to find the next (in order) node. When the node splits (see above) and it is decided the put the lower part in a new leaf, the reference needs to be updated to that leaf.
The getSmallestNotSeen function will return the start-value from that least-range node.
Time & Space Complexity
The space complexity is O(S), where S is the length of the stream
The time complexity of processNext is O(log(S))
The time complexity of getSmallestNotSeen is O(1)
The best case space and time complexity is O(1). Such a best case occurs when the stream has consecutive integers (increasing or decreasing)
bool array[10^6] = {false, false, ... }
int min = 1
void processNext(int val) {
array[val] = true // A
while (array[min]) // B
min++ // C
}
int getSmallestNotSeen() {
return min
}
Time complexity:
processNext: amortised O(1)
getSmallestNotSeen: O(1)
Analysis:
If processNext is invoked k times and n is the highest value stored in min (which could be returned in getSmallestNotSeen), then:
the line A will be executed exactly k times,
the line B will be executed exactly k + n times, and
the line C will be executed exactly n times.
Additionally, n will never be greater than k, because for min to reach n there needs to be a continous range of n true's in the array, and there can be only k true's in the array in total. Therefore, line B can be executed at most 2 * k times and line C at most k times.
Space complexity:
Instead of an array it is possible to use a HashMap without any additional changes in the pseudocode (non-existing keys in the HashMap should evaluate to false). Then the space complexity is O(k). Additionally, you can prune keys smaller than min, thus saving space in some cases:
HashMap<int,bool> map
int min = 1
void processNext(int val) {
if (val < min)
return
map.put(val, true)
while (map.get(min) = true)
map.remove(min)
min++
}
int getSmallestNotSeen() {
return min
}
This pruning technique might be most effective if the stream values increase steadily.
Your solution takes O(N) space to hold the array and the sum segment tree, and O(N) time to initialise them; then O(1) and O(logĀ² N) for the two queries. It seems pretty clear that you can't do better than O(N) space in the long run to keep track of which numbers are "seen" so far, if there are going to be a lot of queries.
However, a different data structure can improve on the query times. Here are three ideas:
Self-balancing binary search tree
Initialise the tree to contain every number from 1 to N; this can be done in O(N) time by building the tree from the leaves up; the leaves have all the odd numbers, then they're joined by all the numbers which are 2 mod 4, then those are joined by the numbers which are 4 mod 8, and so on. The tree takes O(N) space.
processNext is implemented by removing the number from the tree in O(log N) time.
getSmallestNotSeen is implemented by finding the left-most node in O(log N) time.
This is an improvement if getSmallestNotSeen is called many times, but if getSmallestNotSeen is rarely called then your solution is better because it does processNext in O(1) rather than O(log N).
Doubly-linked list
Initialise a doubly-linked list containing the numbers 1 to N in order, and create an array of size N holding pointers to each node. This takes O(N) space and is done in O(N) time. Initialise a variable holding a cached minimum value to be 1.
processNext is implemented by looking up the corresponding list node in the array, and deleting it from the list. If the deleted node has no predecessor, update the cached minimum value to be the value held by the successor node. This is O(1) time.
getSmallestNotSeen is implemented by returning the cached minimum, in O(1) time.
This is also an improvement, and is strictly better asymptotically, although the constants involved might be higher; there's a lot of overhead to hold an array of size N and also a doubly-linked list of size N.
Hash-set
The time requirements for the other solutions are largely determined by their initialisation stages, which take O(N) time. Initialising an empty hash-set, on the other hand, is O(1). As before, we also initialise a variable holding a current minimum value to be 1.
processNext is implemented by inserting the number into the set, in O(1) amortised time.
getSmallestNotSeen updates the current minimum by incrementing it until it's no longer in the set, and then returns it. Membership tests on a hash-set are O(1), and the number of increments over all queries is limited by the number of times processNext is called, so this is also O(1) amortised time.
Asymptotically, this solution takes O(1) time for initialisation and queries, and it uses O(min(Q,N)) space where Q is the number of queries, while the other solutions use O(N) space regardless.
I think it should be straightforward to prove that O(min(Q,N)) space is asymptotically optimal, so the hash-set turns out to be the best option. Credit goes to Dave for combining the hash-set with a current-minimum variable to do getSmallestNotSeen in O(1) amortised time.

How to build a binary tree in O(N ) time?

Following on from a previous question here I'm keen to know how to build a binary tree from an array of N unsorted large integers in order N time?
Unless you have some pre-conditions on the list that allow you to calculate the position in the tree for each item in constant time it is not possible to 'build', that is sequentially insert, items into a tree in O(N) time. Each insertion has to compare up to Log M times where M is the number of items already in the tree.
OK, just for completeness... The binary tree in question is built from an array and has a leaf for every array element. It keeps them in their original index order, not value order, so it doesn't magically let you sort a list in linear time. It also needs to be balanced.
To build such a tree in linear time, you can use a simple recursive algorithm like this (using 0-based indexes):
//build a tree of elements [start, end) in array
//precondition: end > start
buildTree(int[] array, int start, int end)
{
if (end-start > 1)
{
int mid = (start+end)>>1;
left = buildTree(array, start, mid);
right = buildTree(array, mid, end);
return new InternalNode(left,right);
}
else
{
return new LeafNode(array[start]);
}
}
I agree that this seems impossible in general (assuming we have a general, totally ordered set S of N items.) Below is an informal argument where I essentially reduce the building of a BST on S to the problem of sorting S.
Informal argument. Let S be a set of N elements. Now construct a binary search tree T that stores items from S in O(N) time.
Now do an inorder walk of the tree and print values of the leaves as you visit them. You essentially sorted the elements from S. This took you O(|T|) steps, where |T| is the size of the tree (i.e. the number of nodes). (The size of the BST is O(N log N) in the worst case.)
If |T|=o(N log N) then you just solved the general sorting problem in o(N log N) time which is a contradiction.
I have an idea, how it is possible.
Sort array with RadixSort, this is O(N). Thereafter, use recursive procedure to insert into leafs, like:
node *insert(int *array, int size) {
if(size <= 0)
return NULL;
node *rc = new node;
int midpoint = size / 2;
rc->data = array[midpoint];
rc->left = insert(array, midpoint);
rc->right = insert(array + midpoint + 1, size - midpoint - 1);
return rc;
}
Since we do not iterate tree from up to down, but always attach nodes to a current leafs, this is also O(1).

Find First Unique Element

I had this question in interview which I couldn't answer.
You have to find first unique element(integer) in the array.
For example:
3,2,1,4,4,5,6,6,7,3,2,3
Then unique elements are 1, 5, 7 and first unique of 1.
The Solution required:
O(n) Time Complexity.
O(1) Space Complexity.
I tried saying:
Using Hashmaps, Bitvector...but none of them had space complexity O(1).
Can anyone tell me solution with space O(1)?
Here's a non-rigorous proof that it isn't possible:
It is well known that duplicate detection cannot be better than O(n * log n) when you use O(1) space. Suppose that the current problem is solvable in O(n) time and O(1) memory. If we get the index 'k' of the first non-repeating number as anything other than 0, we know that k-1 is a repeated and hence with one more sweep through the array we can get its duplicate making duplicate detection a O(n) exercise.
Again it is not rigorous and we can get into a worst case analysis where k is always 0. But it helps you think and convince the interviewer that it isn't likely to be possible.
http://en.wikipedia.org/wiki/Element_distinctness_problem says:
Elements that occur more than n/k times in a multiset of size n may be found in time O(n log k). Here k = n since we want elements that appear more than once.
I think that this is impossible. This isn't a proof, but evidence for a conjecture. My reasoning is as follows...
First, you said that there is no bound on value of the elements (that they can be negative, 0, or positive). Second, there is only O(1) space, so we can't store more than a fixed number of values. Hence, this implies that we would have to solve this using only comparisons. Moreover, we can't sort or otherwise swap values in the array because we would lose the original ordering of unique values (and we can't store the original ordering).
Consider an array where all the integers are unique:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
In order to return the correct output 1 on this array, without reordering the array, we would need to compare each element to all the other elements, to ensure that it is unique, and do this in reverse order, so we can check the first unique element last. This would require O(n^2) comparisons with O(1) space.
I'll delete this answer if anyone finds a solution, and I welcome any pointers on making this into a more rigorous proof.
Note: This can't work in the general case. See the reasoning below.
Original idea
Perhaps there is a solution in O(n) time and O(1) extra space.
It is possible to build a heap in O(n) time. See Building a Heap.
So you built the heap backwards, starting at the last element in the array and making that last position the root. When building the heap, keep track of the most recent item that was not a duplicate.
This assumes that when inserting an item in the heap, you will encounter any identical item that already exist in the heap. I don't know if I can prove that . . .
Assuming the above is true, then when you're done building the heap, you know which item was the first non-duplicated item.
Why it won't work
The algorithm to build a heap in place starts at the midpoint of the array and assumes that all of the nodes beyond that point are leaf nodes. It then works backward (towards item 0), sifting items into the heap. The algorithm doesn't examine the last n/2 items in any particular order, and the order changes as items are sifted into the heap.
As a result, the best we could do (and even then I'm not sure we could do it reliably) is find the first non-duplicated item only if it occurs in the first half of the array.
OP's question original doesn't mention the limit of the number(although latter add number can be negative/positive/zero). Here I assume one more condition:
The number in array are all smaller than array length and
non-negative.
Then, giving a O(n) time, O(1) space solution is possible and seems like a interview question, and the the test case OP gives in the question comply to above assumption.
Solution:
for (int i = 0; i < nums.length; i++) {
if (nums[i] != i) {
if (nums[i] == -1) continue;
if (nums[nums[i]] == nums[i]) {
nums[nums[i]] = -1;
} else {
swap(nums, nums[i], i);
i--;
}
}
}
}
for (int i = 0; i < nums.length; i++) {
if (nums[i] == i) {
return i;
}
}
The algorithm here is considering the original array as bucket in bucket sort. Put numbers into its bucket, if more than twice, mark it as -1. Using another loop to find the first number that has nums[i] == i

How to find sum of elements from given index interval (i, j) in constant time?

Given an array. How can we find sum of elements in index interval (i, j) in constant time. You are allowed to use extra space.
Example:
A: 3 2 4 7 1 -2 8 0 -4 2 1 5 6 -1
length = 14
int getsum(int* arr, int i, int j, int len);
// suppose int array "arr" is initialized here
int sum = getsum(arr, 2, 5, 14);
sum should be 10 in constant time.
If you can spend O(n) time to "prepare" the auxiliary information, based on which you would be able calculate sums in O(1), you could easily do it.
Preparation (O(n)):
aux[0] = 0;
foreach i in (1..LENGTH) {
aux[i] = aux[i-1] + arr[i];
}
Query (O(1)), arr is numerated from 1 to LENGTH:
sum(i,j) = aux[j] - aux[i-1];
I think it was the intent, because, otherwise, it's impossible: for any length to calculate sum(0,length-1) you should have scanned the whole array; this takes linear time, at least.
It cannot be done in constant time unless you store the information.
You would have to do something like specially modify the array to store, for each index, the sum of all values between the start of the array and this index, then using subtraction on the range to get the difference in sums.
However, nothing in your code sample seems to allow this. The array is created by the user (and can change at any time) and you have no control over it.
Any algorithm that needs to scan a group of elements in a sequential unsorted list will be O(n).
Previous answers are absolutely fine for the question asked. I am just adding a point, if this question is changed a bit like:
Find the sum of the interval, if the array gets changed dynamically.
If array elements get changed, then we have to recompute whatever sum we have stored in the auxiliary array as mentioned in #Pavel Shved's approach.
Recomputing is O(n) operation and hence we need to reduce the complexity down to O(nlogn) by making use of Segment Tree.
http://www.geeksforgeeks.org/segment-tree-set-1-sum-of-given-range/
There are three known algorithms for range based queries given [l,r]
1.Segment tree: total query time O(NlogN)
2.Fenwick tree: total query time O(NlogN)
3.Mo's algorithm(square root decomposition)
The first two algorithms can deal with modifications in the list/array given to you. The third algorithm or Mo's algorithm is an offline algorithm means all the queries need to be given to you prior. Modifications in the list/array are not allowed in this algorithm. For implementation, runtime and further reading of this algorithm you can check out this Medium blog. It explains with code. And a very few people actually know about this method.
this question will solve O(n^2)time,O(n)space or O(n)time,O(n)space..
Now the best optimal solution in this case (i.e O(n)time,O(n))
suppose a[]={1,3,5,2,6,4,9} is given
if we create an array(sum[]) in which we kept the value of sum of 0 index to that particular index.like for array a[],sum array will be sum[]={1,4,9,11,17,21,30};like
{1,3+1,3+1+5......} this takes O(n)time and O(n) space..
when we give index then it directly fetch from sum array it means add(i,j)=sum[j]-sum[i-1]; and this takes O(1) times and O(1) spaces...
so,this program takes O(n) time and O(N) spaces..
int sum[]=new int[l];
sum[0]=a[0];
System.out.print(cumsum[0]+" ");
for(int i=1;i<l;i++)
{
sum[i]=sum[i-1]+a[i];
System.out.print(sum[i]+" ");
}
?* this gives 1,4,9,11,17,21,30 and take O(n)time and O(n) spaces */
sum(i,j)=sum[j]-sum[i-1]/this gives sum of indexes from i to j and take O(1)time and O(1) spaces/
so,this program takes O(n) time and O(N) spaces..emphasized text

What would be the time complexity of counting the number of all structurally different binary trees?

Using the method presented here: http://cslibrary.stanford.edu/110/BinaryTrees.html#java
12. countTrees() Solution (Java)
/**
For the key values 1...numKeys, how many structurally unique
binary search trees are possible that store those keys?
Strategy: consider that each value could be the root.
Recursively find the size of the left and right subtrees.
*/
public static int countTrees(int numKeys) {
if (numKeys <=1) {
return(1);
}
else {
// there will be one value at the root, with whatever remains
// on the left and right each forming their own subtrees.
// Iterate through all the values that could be the root...
int sum = 0;
int left, right, root;
for (root=1; root<=numKeys; root++) {
left = countTrees(root-1);
right = countTrees(numKeys - root);
// number of possible trees with this root == left*right
sum += left*right;
}
return(sum);
}
}
I have a sense that it might be n(n-1)(n-2)...1, i.e. n!
If using a memoizer, is the complexity O(n)?
The number of full binary trees with number of nodes n is the nth Catalan number. Catalan Numbers are calculated as
which is complexity O(n).
http://mathworld.wolfram.com/BinaryTree.html
http://en.wikipedia.org/wiki/Catalan_number#Applications_in_combinatorics
It's easy enough to count the number of calls to countTrees this algorithm uses for
a given node count. After a few trial runs, it looks to me like it requires 5*3^(n-2) calls for n >= 2, which grows much more slowly than n!. The proof of this assertion is left as an exercise for the reader. :-)
A memoized version required O(n) calls, as you suggested.
Incidentally, the number of binary trees with n nodes equals the n-th Catalan number.
The obvious approaches to calculating Cn all seem to be linear in n, so a memoized implementation of countTrees is probably the best one can do.
Not sure of how many hits to the look-up table is the memoized version going to make (which is definitely super-linear and will have the overheads of function calling) but with the mathematical proof yielding the result to be the same as nth Catalan number, one can quickly cook up a linear-time tabular method:
int C=1;
for (int i=1; i<=n; i++)
{
C = (2*(2*(i-1)+1)*C/((i-1)+2));
}
return C;
Note the difference between Memoization and Tabulation here

Resources