How to noisily select k smallest elements of an array? - algorithm

So I wrote a function to find the k nodes of a graph that have the smallest degree. It looks like this:
def smallestKNodes(G, k):
leastK = []
for i in range(G.GetMxNId()):
# Produces an iterator to the node
node = G.GetNI(i)
for j in range(k):
if j >= len(leastK):
leastK.append(node)
break
elif node.GetDeg() < leastK[j].GetDeg():
leastK.insert(j, node)
leastK = leastK[0:k]
break
return leastK[0:k]
My problem is when all the nodes have the same degree, it selects the same nodes every time. How can I make it so it takes all the nodes with zero degree or whatever and then selects k nodes randomly?
Stipulations:
(1) Suppose k = 7, then if there are 3 nodes with degree 0 and 10 nodes with degree 1, I would like to choose all the nodes with degree 0, but randomly choose 4 of the nodes with degree 1.
(2) If possible I don't want to visit any node twice because there might be too many nodes to fit into memory. There might also be a very large number of nodes with minimum degree. In some cases there might also be a very small number of nodes.

Store all the nodes which satisfy your condition and randomly pick k nodes from it. You can do the random pick by shuffling the array (e.g. Fisher-Yates, std::shuffle, randperm, etc.) and picking the first k nodes (for example).

You might want to do two passes, the first pass to discover the relevant degree you have to randomize, how many nodes of that degree to choose, and the total number of nodes with that degree. Then, do a second pass on your nodes, choosing only those with the desired degree at random.
To choose k nodes of n total so each node has a fair probability (k/n), loop over relevant nodes, and choose each one with probability 1, 1, ..., 1, k/(k+1), k/(k+2), ..., k/n. When choosing a node, if k nodes are already chosen, throw one of them away at random.
def randomNodesWithSpecificDegree(G, d, k, n):
result = []
examined = 0
for i in range(G.GetMxNId()):
# Produces an iterator to the node
node = G.GetNI(i)
if node.GetDeg() = d:
examined = examined + 1
if len(result) < k:
result.append(node)
elif random(0...1) < k / examined
index = random(0...k-1)
result[index] = node
assert(examined = n)
return result
This pseudo-code is good when k is small and n is big (seems your case).

Related

How many different rooted unlabelled binary trees have exacly 9 nodes and are left-heavy?

I know how many trees are possible using nth Catalan number but don't know how to find left-heavy trees. Is there any technique?
The answer is: 2357
I provide here a reasoned approach (no programming involved) and code to produce the same result, but via a more brute-force method.
A. Reasoning
Intuitively, it seems easier to count by exclusion. So that leads to this approach:
Count all the binary trees with 9 nodes. As you already indicated, this corresponds to the 9th Catalan number. This is C9 = 4862.
Subtract the number of trees whose roots are balanced, i.e. where the two subtrees of the root have equal heights (let's call those subtrees L and R). That gives us the number of trees that are either left- or right-heavy.
As there are just as many left- as right-heavy trees, divide this result by two to get the final result.
So now we can focus on calculating the number mentioned in the second bullet point:
Counting trees that are strictly balanced at the root
A tree of height 2 would have at most 7 nodes (when it is full), so the height needs to be at least 3. A tree of height 5 (that is balanced at the root) needs at least 5 nodes in A (on a single path), and 5 in B (also on a single path), so the height cannot be more than 4. We thus have only two possibilities: the height of a 9-node binary tree, that is balanced in the root, is either 3 or 4.
Let's deal with these two cases separately:
1. When the height of the tree is 3
In this case we have a tree with 4 levels. Let's analyse each level:
There are 3 nodes in the first two levels: the root and its two children.
The third level is the first level where there can be some variation: that level has between 2 and 4 nodes. Let's deal with those three cases one by one:
1.a Third level has 2 nodes
Here the third level has one node in L and one in R. Each can either be a left or right child of its parent. So there are two possibilities at either side: 2x2 = 4 possibilities.
There is no variation possible in the fourth level: the four remaining nodes are children of the two nodes in the third level.
Possibilities: 4
1.b Third level has 3 nodes
There are 4 ways to select three positions from the four available positions in the third level. Either L or R gets only one node. Let's call this node x.
In the fourth level we need to distribute the three remaining nodes, such that L and R get at least one of those. This is achieved when x gets either one or two children.
When x gets two children, the other remaining node has 4 possible positions. Here you see those 4 positions in light grey:
When x gets one child, it can be either a left or right child, and the other two remaining nodes can occupy the four available positions (see image above) in 6 ways: 2x6=12.
So given a choice in the third level, there are 4+12=16 possible configurations for the fourth level.
Combining this with the possibilities in the third level, we get 4x16:
Possibilities: 64
1.c Third level has 4 nodes
The third level is thus full. The two remaining nodes on the fourth level need to be split between L and R, and so each has 4 possible positions. This gives 4x4 = 16 possibilities in total.
Possibilities: 16
2. When the height of the tree is 4
When the height is 4, then by consequence L and R each have only one leaf: they are chains of 4 nodes each. This is the only way to make the root strictly balanced and get a height of 4.
There is no choice for the root node of L (it is the left child of the root), but from there on, each next descendant in L can be either a left or right child of its parent. The shape of L has thus 23 possibilities = 8. Considering the same for R, we have a total of 8x8 = 64 shapes.
Possibilities: 64
Total
Taking all of the above together, we have 4+64+16 + 64 = 148 possible shapes that give a tree with a balanced root.
So applying the approach set out at the top, the total number of left-heavy binary trees with 9 unlabelled nodes is (4862-148)/2 = 2357
B. Code
To make this a programming challenge, here is an implementation in JavaScript that defines the following functions:
countTreesUpToHeight(n, height): count all binary trees with n nodes, that are not higher than the given height. Uses recursion.
countTreesWithHeight(n, height): count all binary trees with n nodes, that have exactly the given height. Uses the preceding function.
countLeftHeavy(n): the main function. Uses the other two functions to count all combinations where the root's left subtree is higher than the right one.
So this approach is not like the exclusion approach above. It actually counts the combinations of interest. The output is the same.
function countTreesUpToHeight(n, height) {
if (n > 2**(height+1) - 1) return 0; // too many nodes to fit within height
if (n < 2) return 1;
let count = 0;
for (let i = 0; i < n; i++) {
count += countTreesUpToHeight(i, height-1)
* countTreesUpToHeight(n-1-i, height-1);
}
return count;
}
function countTreesWithHeight(n, height) {
return countTreesUpToHeight(n, height) - countTreesUpToHeight(n, height-1);
}
function countLeftHeavy(n) {
let count = 0;
// make choices for the height of the left subtree
for (let height = 0; height < n; height++) {
// make choices for the number of nodes in the left subtree
for (let i = 0; i < n; i++) {
// multiply the number of combinations for the left subtree
// with those for the right subtree
count += countTreesWithHeight(i, height-1)
* countTreesUpToHeight(n-1-i, height-2);
}
}
return count;
}
let result = countLeftHeavy(9);
console.log(result); // 2357

Fibonacci sums on a tree

Given a tree with n nodes (n can be as large as 2 * 10^5), where each node has a cost associated with it, let us define the following functions:
g(u, v) = the sum of all costs on the simple path from u to v
f(n) = the (n + 1)th Fibonacci number (n + 1 is not a typo)
The problem I'm working on requires me to compute the sum of f(g(u, v)) over all possible pairs of nodes in the tree modulo 10^9 + 7.
As an example, let's take a tree with 3 nodes.
without loss of generality, let's say node 1 is the root, and its children are 2 and 3
costs[1] = 2, cost[2] = 1, cost[3] = 1
g(1, 1) = 2; f(2) = 2
g(2, 2) = 1; f(1) = 1
g(3, 3) = 1; f(1) = 1
g(1, 2) = 3; f(3) = 3
g(2, 1) = 3; f(3) = 3
g(1, 3) = 3; f(3) = 3
g(3, 1) = 3; f(3) = 3
g(2, 3) = 4; f(4) = 5
g(3, 2) = 4; f(4) = 5
Summing all of the values, and taking the result modulo 10^9 + 7 gives 26 as the correct answer.
My attempt:
I implemented an algorithm to compute g(u, v) in O(log n) by finding the lowest common ancestor using a sparse table.
For the finding of the appropriate Fibonacci values, I tried two approaches, namely using exponentiation on the matrix form and another by noticing that the sequence modulo 10^9 + 7 is cyclical.
Now comes the extremely tricky part. No matter how I do the above computations, I still end up going to up to O(n^2) pairs when calculating the sum of all possible f(g(u, v)). I mean there's the obvious improvement of only going up to n * (n - 1) / 2 pairs but that's still quadratic.
What am I missing? I've been at it for several hours, but I can't see a way to get that sum without actually producing a quadratic algorithm.
To know how many times the cost of a node X is to be included in the total sum, we divide the other nodes into 3 (or more) groups:
the subtree A connected to the left of X
the subtree B connected to the right of X
(subtrees C, D... if the tree is not binary)
all other nodes Y, connected through X's parent
When two nodes belong to different groups, their simple path goes through X. So the number of simple paths that go through X is:
#Y + #A × (N - #A) + #B × (N - #B)
So by counting the total number of nodes N, and the size of the subtrees under X, you can calculate how many times the cost of node X should be included in the total sum. Do this for every node and you have the total cost.
The code for this could be straightforward. I'll assume that the total number of nodes N is known, and that you can add properties to the nodes (both of these assumptions simplify the algorithm, but it can be done without them).
We'll add a child_count to store the number of descendants of the node, and a path_count to store the number of simple paths that the node is part of; both are initialised to zero.
For each node, starting from the root:
If not all children have been visited, go to an unvisited child.
If all children have been visited (or node is leaf):
Increment child_count.
Increase path_count with N - child_count.
Add this node's path_count × cost to the total cost.
If the current node is the root, we're done; otherwise:
Increase the parent node's child_count with this node's child_count.
Increase the parent node's path_count with this node's child_count × (N - child_count).
Go to the parent node.
The below algorithm's running time is O(n^3).
Tree is a strongly connected graph without loops. So when we want to get all possible pairs' costs, we are trying to find the shortest paths for all pairs. Thus, we can use Dijkstra's idea and dynamic programming approach for this problem (I took it from Weiss's book). Then we apply Fibonacci function to the cost, assuming that we already have a table to look up.
Dijkstra's idea: We start from the root and search all simple paths from the root to all other nodes and then do that for other vertices on the graph.
Dynamic programming approach: We use a 2D matrix D[][] to represent the lowest path/cost (They could be used exchangeably.) between node i and node j. Initially, D[i][i] is already set. If node i and node j is parent/child, D[i][j] = g(i, j), which is the cost between them. If node k is on the path which has lower cost for node i and node j, we can update the D[i][j], i.e., D[i][j] = D[i][k] + D[k][j] if D[i][j] < D[i][k] + D[k][j] else D[i][j].
When done, we check D[][] matrix and apply Fibonacci function to each cell and add them up, and also apply modulo operation.

Find max subset of tree with max distance not greater than K

I run into a dynamic programming problem on interviewstreet named "Far Vertices".
The problem is like:
You are given a tree that has N vertices and N-1 edges. Your task is
to mark as small number of verices as possible so that the maximum
distance between two unmarked vertices be less than or equal to K. You
should write this value to the output. Distance between two vertices i
and j is defined as the minimum number of edges you have to pass in
order to reach vertex i from vertex j.
I was trying to do dfs from every node of the tree, in order to find the max connected subset of the nodes, so that every pair of subset did not have distance more than K.
But I could not define the state, and transitions between states.
Is there anybody that could help me?
Thanks.
The problem consists essentially of finding the largest subtree of diameter <= k, and subtracting its size from n. You can solve it using DP.
Some useful observations:
The diameter of a tree rooted at node v (T(v)) is:
1 if n has no children,
max(diameter T(c), height T(c) + 1) if there is one child c,
max(max(diameter T(c)) for all children c of v, max(height T(c1) + height T(c2) + 2) for all children c1, c2 of v, c1 != c2)
Since we care about maximizing tree size and bounding tree diameter, we can flip the above around to suggest limits on each subtree:
For any tree rooted at v, the subtree of interest is at most k deep.
If n is a node in T(v) and has no children <= k away from v, its maximum size is 1.
If n has one child c, the maximum size of T(n) of diameter <= k is max size T(c) + 1.
Now for the tricky bit. If n has more than one child, we have to find all the possible tree sizes resulting from allocating the available depth to each child. So say we are at depth 3, k = 7, we have 4 depth left to play with. If we have three children, we could allocate all 4 to child 1, 3 to child 1 and 1 to child 2, 2 to child 1 and 1 to children 2 and 3, etc. We have to do this carefully, making sure we don't exceed diameter k. You can do this with a local DP.
What we want for each node is to calculate maxSize(d), which gives the max size of the tree rooted at that node that is up to d deep that has diameter <= k. Nodes with 0 and 1 children are easy to figure this for, as above (for example, for one child, v.maxSize(i) = c.maxSize(i - 1) + 1, v.maxSize(0) = 1). Nodes with 2 or more children, you compute dp[i][j], which gives the max size of a k-diameter-bound tree using up to the ith child taking up to j depth. The recursion is dp[i][j] = max(child(i).maxSize(m - 1) + dp[i - 1][min(j, k - m)] for m from 1 to j. d[i][0] = 1. This says, try giving the ith child 1 to j depth, and give the rest of the available depth to the previous nodes. The "rest of the available depth" is the minimum of j, the depth we are working with, or k - m, because depth given to child i + depth given to the rest cannot exceed k. Transfer the values of the last row of dp to the maxSize table for this node. If you run the above using a depth-limited DFS, it will compute all the necessary maxSize entries in the correct order, and the answer for node v is v.maxSize(k). Then you do this once for every node in the tree, and the answer is the maximum value found.
Sorry for the muddled nature of the explanation. It was hard for me to think through, and difficult to describe. Working through a few simple examples should make it clearer. I haven't calculated the complexity, but n is small, and it went through all the test cases in .5 to 1s in Scala.
A few basic things I can notice (maybe very obvious to others):
1. There is only one route possible between two given vertices.
2. The farthest vertices would be the one with only one outgoing edge.
Now to solve the issue.
I would start with the set of Vertices that have only one edge and call them EDGE[] calculate the distances between the vertices in EDGE[]. This will give you (EDGE[i],EDGE[j],distance ) value pairs
For all the vertices pairs in EDGE that have a distance of > K, DO EDGE[i].occur++,EDGE[i].distance = MAX(EDGE[i].distance, distance)
EDGE[j].occur++,EDGE[j].distance = MAX(EDGE[j].distance, distance)
Find the CANDIDATES in EDGE[] that have max(distance) from those Mark the with with max (occur)
Repeat till all edge vertices pair have distance less then or equal to K

Efficient algorithm for random sampling from a distribution while allowing updates?

This is the question I was asked some time ago on interview, I could not find answer for.
Given some samples S1, S2, ... Sn and their probability distributions(or weights, whatever it is called) P1, P2, .. Pn, design algorithm that randomly chooses sample taking into account its probability. the solution I came with is as follows:
Build cumulative array of weights Ci, such
C0 = 0;
Ci = C[i-1] + Pi.
at the same time calculate T=P1+P2+...Pn.
It takes O(n) time
Generate uniformly random number R = T*random[0..1]
Using binary search algorithm, return least i such Ci >= R.
result is Si. It takes O(logN) time.
Now the actual question is:
Suppose I want to change one of the initial Weights Pj. how to do this in better than O(n) time?
other data structures are acceptable, but random sampling algorithm should not get worse than O(logN).
One way to solve this is to rethink how your binary search tree containing the cumulative totals is built. Rather than building a binary search tree, think about having each node interpreted as follows:
Each node stores a range of values that are dedicated to the node itself.
Nodes in the left subtree represent sampling from the probability distribution just to the left of that range.
Nodes in the right subtree represent sampling from the probability distribution just to the right of that range.
For example, suppose our weights are 3, 2, 2, 2, 2, 1, and 1 for events A, B, C, D, E, F, and G. We build this binary tree holding A, B, C, D, E, F, and G:
D
/ \
B F
/ \ / \
A C E G
Now, we annotate the tree with probabilities. Since A, C, E, and G are all leaves, we give each of them probability mass one:
D
/ \
B F
/ \ / \
A C E G
1 1 1 1
Now, look at the tree for B. B has weight 2 of being chosen, A has weight 3 of being chosen, and C has probability 2 of being chosen. If we normalize these to the range [0, 1), then A accounts for 3/7 of the probability and B and C each account for 2/7s. Thus we have the node for B say that anything in the range [0, 3/7) goes to the left subtree, anything in the range [3/7, 5/7) maps to B, and anything in the range [5/7, 1) maps to the right subtree:
D
/ \
B F
[0, 3/7) / \ [5/7, 1) / \
A C E G
1 1 1 1
Similarly, let's process F. E has weight 2 of being chosen while F and G each have probability weight 1 of being chosen. Thus the subtree for E accounts for 1/2 of the probability mass here, the node F accounts for 1/4, and the subtree for G accounts for 1/4. This means we can assign probabilities as
D
/ \
B F
[0, 3/7) / \ [5/7, 1) [0, 1/2) / \ [3/4, 1)
A C E G
1 1 1 1
Finally, let's look at the root. The combined weight of the left subtree is 3 + 2 + 2 = 7. The combined weight of the right subtree is 2 + 1 + 1 = 4. The weight of D itself is 2. Thus the left subtree has probability 7/13 of being picked, D has probability 2/13 of being picked, and the right subtree has probability 4/13 of being picked. We can thus finalized the probabilities as
D
[0, 7/13) / \ [9/13, 1)
B F
[0, 3/7) / \ [5/7, 1) [0, 1/2) / \ [3/4, 1)
A C E G
1 1 1 1
To generate a random value, you would repeat the following:
Starting at the root:
Choose a uniformly-random value in the range [0, 1).
If it's in the range for the left subtree, descend into it.
If it's in the range for the right subtree, descend into it.
Otherwise, return the value corresponding to the current node.
The probabilities themselves can be determined recursively when the tree is built:
The left and right probabilities are 0 for any leaf node.
If an interior node itself has weight W, its left tree has total weight WL, and its right tree has total weight WR, then the left probability is (WL) / (W + WL + WR) and the right probability is (WR) / (W + WL + WR).
The reason that this reformulation is useful is that it gives us a way to update probabilities in O(log n) time per probability updated. In particular, let's think about what invariants are going to change if we update some particular node's weight. For simplicity, let's assume the node is a leaf for now. When we update the leaf node's weight, the probabilities are still correct for the leaf node, but they're incorrect for the node just above it, because the weight of one of that node's subtrees has changed. Thus we can (in O(1) time) recompute the probabilities for the parent node by just using the same formula as above. But then the parent of that node no longer has the correct values because one of its subtree weights has changed, so we can recompute the probability there as well. This process repeats all the way back up to the root of the tree, with us doing O(1) computation per level to rectify the weights assigned to each edge. Assuming that the tree is balanced, we therefore have to do O(log n) total work to update one probability. The logic is identical if the node isn't a leaf node; we just start somewhere in the tree.
In short, this gives
O(n) time to construct the tree (using a bottom-up approach),
O(log n) time to generate a random value, and
O(log n) time to update any one value.
Hope this helps!
Instead of an array, store the search structured as a balanced binary tree. Every node of the tree should store the total weight of the elements it contains. Depending on the value of R, the search procedure either returns the current node or searches through the left or right subtree.
When the weight of an element is changed, the updating of the search structure is a matter of adjusting the weights on the path from the element to the root of the tree.
Since the tree is balanced, the search and the weight update operations are both O(log N).
For those of you who would like some code, here's a python implementation:
import numpy
class DynamicProbDistribution(object):
""" Given a set of weighted items, randomly samples an item with probability
proportional to its weight. This class also supports fast modification of the
distribution, so that changing an item's weight requires O(log N) time.
Sampling requires O(log N) time. """
def __init__(self, weights):
self.num_weights = len(weights)
self.weights = numpy.empty((1+len(weights),), 'float32')
self.weights[0] = 0 # Not necessary but easier to read after printing
self.weights[1:] = weights
self.weight_tree = numpy.zeros((1+len(weights),), 'float32')
self.populate_weight_tree()
def populate_weight_tree(self):
""" The value of every node in the weight tree is equal to the sum of all
weights in the subtree rooted at that node. """
i = self.num_weights
while i > 0:
weight_sum = self.weights[i]
twoi = 2*i
if twoi < self.num_weights:
weight_sum += self.weight_tree[twoi] + self.weight_tree[twoi+1]
elif twoi == self.num_weights:
weight_sum += self.weights[twoi]
self.weight_tree[i] = weight_sum
i -= 1
def set_weight(self, item_idx, weight):
""" Changes the weight of the given item. """
i = item_idx + 1
self.weights[i] = weight
while i > 0:
weight_sum = self.weights[i]
twoi = 2*i
if twoi < self.num_weights:
weight_sum += self.weight_tree[twoi] + self.weight_tree[twoi+1]
elif twoi == self.num_weights:
weight_sum += self.weights[twoi]
self.weight_tree[i] = weight_sum
i /= 2 # Only need to modify the parents of this node
def sample(self):
""" Returns an item index sampled from the distribution. """
i = 1
while True:
twoi = 2*i
if twoi < self.num_weights:
# Two children
val = numpy.random.random() * self.weight_tree[i]
if val < self.weights[i]:
# all indices are offset by 1 for fast traversal of the
# internal binary tree
return i-1
elif val < self.weights[i] + self.weight_tree[twoi]:
i = twoi # descend into the subtree
else:
i = twoi + 1
elif twoi == self.num_weights:
# One child
val = numpy.random.random() * self.weight_tree[i]
if val < self.weights[i]:
return i-1
else:
i = twoi
else:
# No children
return i-1
def validate_distribution_results(dpd, weights, samples_per_item=1000):
import time
bins = numpy.zeros((len(weights),), 'float32')
num_samples = samples_per_item * numpy.sum(weights)
start = time.time()
for i in xrange(num_samples):
bins[dpd.sample()] += 1
duration = time.time() - start
bins *= numpy.sum(weights)
bins /= num_samples
print "Time to make %s samples: %s" % (num_samples, duration)
# These should be very close to each other
print "\nWeights:\n", weights
print "\nBins:\n", bins
sdev_tolerance = 10 # very unlikely to be exceeded
tolerance = float(sdev_tolerance) / numpy.sqrt(samples_per_item)
print "\nTolerance:\n", tolerance
error = numpy.abs(weights - bins)
print "\nError:\n", error
assert (error < tolerance).all()
##test
def test_DynamicProbDistribution():
# First test that the initial distribution generates valid samples.
weights = [2,5,4, 8,3,6, 6,1,3, 4,7,9]
dpd = DynamicProbDistribution(weights)
validate_distribution_results(dpd, weights)
# Now test that we can change the weights and still sample from the
# distribution.
print "\nChanging weights..."
dpd.set_weight(4, 10)
weights[4] = 10
dpd.set_weight(9, 2)
weights[9] = 2
dpd.set_weight(5, 4)
weights[5] = 4
dpd.set_weight(11, 3)
weights[11] = 3
validate_distribution_results(dpd, weights)
print "\nTest passed"
if __name__ == '__main__':
test_DynamicProbDistribution()
I've implemented a version related to Ken's code, but is balanced with a red/black tree for worst case O(log n) operations. This is available as weightedDict.py at: https://github.com/google/weighted-dict
(I would have added this as a comment to Ken's answer, but don't have the reputation to do that!)

binary tree data structures

Can anybody give me proof how the number of nodes in strictly binary tree is 2n-1 where n is the number of leaf nodes??
Proof by induction.
Base case is when you have one leaf. Suppose it is true for k leaves. Then you should proove for k+1. So you get the new node, his parent and his other leaf (by definition of strict binary tree). The rest leaves are k-1 and then you can use the induction hypothesis. So the actual number of nodes are 2*(k-1) + 3 = 2k+1 == 2*(k+1)-1.
just go with the basics, assuming there are x nodes in total, then we have n nodes with degree 1(leaves), 1 with degree 2(the root) and x-n-1 with degree 3(the inner nodes)
as a tree with x nodes will have x-1 edges. so summing
n + 3*(x-n-1) + 2 = 2(x-1) (equating the total degrees)
solving for x we get x = 2n-1
I'm guessing that what you really want is something like a proof that the depth is log2(N), where N is the number of nodes. In this case, the answer is fairly simple: for any given depth D, the number of nodes is 2D.
Edit: in response to edited question: the same fact pretty much applies. Since the number of nodes at any depth is 2D, the number of nodes further up the tree is 2D-1 + 2D-2 + ...20 = 2D-1. Therefore, the total number of nodes in a balanced binary tree is 2D + 2D-1. If you set n = 2D, you've gone the full circle back to the original equation.
I think you are trying to work out a proof for: N = 2L - 1 where L is the number
of leaf nodes and N is the total number of nodes in a binary tree.
For this formula to hold you need to put a few restrictions on how the binary
tree is constructed. Each node is either a leaf, which means it has no children, or
it is an internal node. Internal nodes have 3
possible configurations:
2 child nodes
1 child and 1 internal node
2 internal nodes
All three configurations imply that an internal node connects to two other nodes. This explicitly
rules out the situation where node connects to a single child as in:
o
/
o
Informal Proof
Start with a minimal tree of 1 leaf: L = 1, N = 1 substitute into N = 2L - 1 and the see that
the formula holds true (1 = 1, so far so good).
Now add another minimal chunk to the tree. To do that you need to add another two nodes and
tree looks like:
o
/ \
o o
Notice that you must add nodes in pairs to satisfy the restriction stated earlier.
Adding a pair of nodes always adds
one leaf (two new leaf nodes, but you loose one as it becomes an internal node). Node growth
progresses as the series: 1, 3, 5, 7, 9... but leaf growth is: 1, 2, 3, 4, 5... That is why the formula
N = 2L - 1 holds for this type of tree.
You might use mathematical induction to construct a formal proof, but this works find for me.
Proof by mathematical induction:
The statement that there are (2n-1) of nodes in a strictly binary tree with n leaf nodes is true for n=1. { tree with only one node i.e root node }
let us assume that the statement is true for tree with n-1 leaf nodes. Thus the tree has 2(n-1)-1 = 2n-3 nodes
to form a tree with n leaf nodes we need to add 2 child nodes to any of the leaf nodes in the above tree. Thus the total number of nodes = 2n-3+2 = 2n-1.
hence, proved
To prove: A strictly binary tree with n leaves contains 2n-1 nodes.
Show P(1): A strictly binary tree with 1 leaf contains 2(1)-1 = 1 node.
Show P(2): A strictly binary tree with 2 leaves contains 2(2)-1 = 3 nodes.
Show P(3): A strictly binary tree with 3 leaves contains 2(3)-1 = 5 nodes.
Assume P(K): A strictly binary tree with K leaves contains 2K-1 nodes.
Prove P(K+1): A strictly binary tree with K+1 leaves contains 2(K+1)-1 nodes.
2(K+1)-1 = 2K+2-1
= 2K+1
= 2K-1 +2*
* This result indicates that, for each leaf that is added, another node must be added to the father of the leaf , in order for it to continue to be a strictly binary tree. So, for every additional leaf, a total of two nodes must be added, as expected.
int N = 1000; insert here the value of N
int sum = 0; // the number of total nodes
int currFactor = 1;
for (int i = 0; i< log(N); ++i) //the is log(N) levels
{
sum += currFactor;
currFactor *= 2; //in each level the number of node is double than the upper level
}
if(sum == 2*N - 1)
{
cout<<"wow that the number of nodes is 2*N-1";
}

Resources