More localized, efficient Lowest Common Ancestor algorithm given multiple binary trees? - algorithm

I have multiple binary trees stored as an array. In each slot is either nil (or null; pick your language) or a fixed tuple storing two numbers: the indices of the two "children". No node will have only one child -- it's either none or two.
Think of each slot as a binary node that only stores pointers to its children, and no inherent value.
Take this system of binary trees:
0 1
/ \ / \
2 3 4 5
/ \ / \
6 7 8 9
/ \
10 11
The associated array would be:
0 1 2 3 4 5 6 7 8 9 10 11
[ [2,3] , [4,5] , [6,7] , nil , nil , [8,9] , nil , [10,11] , nil , nil , nil , nil ]
I've already written simple functions to find direct parents of nodes (simply by searching from the front until there is a node that contains the child)
Furthermore, let us say that at relevant times, both all trees are anywhere between a few to a few thousand levels deep.
I'd like to find a function
P(m,n)
to find the lowest common ancestor of m and n -- to put more formally, the LCA is defined as the "lowest", or deepest node in which have m and n as descendants (children, or children of children, etc.). If there is none, a nil would be a valid return.
Some examples, given our given tree:
P( 6,11) # => 2
P( 3,10) # => 0
P( 8, 6) # => nil
P( 2,11) # => 2
The main method I've been able to find is one that uses an Euler trace, which turns the given tree (Adding node A as the invisible parent of 0 and 1, with a "value" of -1), into:
A-0-2-6-2-7-10-7-11-7-2-0-3-0-A-1-4-1-5-8-5-9-5-1-A
And from that, simply find the node between your given m and n that has the lowest number; For example, to find P(6,11), look for a 6 and an 11 on the trace. The number between them that is the lowest is 2, and that's your answer. If A (-1) is in between them, return nil.
-- Calculating P(6,11) --
A-0-2-6-2-7-10-7-11-7-2-0-3-0-A-1-4-1-5-8-5-9-5-1-A
^ ^ ^
| | |
m lowest n
Unfortunately, I do believe that finding the Euler trace of a tree that can be several thousands of levels deep is a bit machine-taxing...and because my tree is constantly being changed throughout the course of the programming, every time I wanted to find the LCA, I'd have to re-calculate the Euler trace and hold it in memory every time.
Is there a more memory efficient way, given the framework I'm using? One that maybe iterates upwards? One way I could think of would be the "count" the generation/depth of both nodes, and climb the lowest node until it matched the depth of the highest, and increment both until they find someone similar.
But that'd involve climbing up from level, say, 3025, back to 0, twice, to count the generation, and using a terribly inefficient climbing-up algorithm in the first place, and then re-climbing back up.
Are there any other better ways?
Clarifications
In the way this system is built, every child will have a number greater than their parents.
This does not guarantee that if n is in generation X, there are no nodes in generation (X-1) that are greater than n. For example:
0
/ \
/ \
/ \
1 2 6
/ \ / \ / \
2 3 9 10 7 8
/ \ / \
4 5 11 12
is a valid tree system.
Also, an artifact of the way the trees are built are that the two immediate children of the same parent will always be consecutively numbered.

Are the nodes in order like in your example where the children have a larger id than the parent? If so, you might be able to do something similar to a merge sort to find them.. for your example, the parent tree of 6 and 11 are:
6 -> 2 -> 0
11 -> 7 -> 2 -> 0
So perhaps the algorithm would be:
left = left_start
right = right_start
while left > 0 and right > 0
if left = right
return left
else if left > right
left = parent(left)
else
right = parent(right)
Which would run as:
left right
---- -----
6 11 (right -> 7)
6 7 (right -> 2)
6 2 (left -> 2)
2 2 (return 2)
Is this correct?

Maybe this will help: Dynamic LCA Queries on Trees.
Abstract:
Richard Cole, Ramesh Hariharan
We show how to maintain a data
structure on trees which allows for
the following operations, all in
worst-case constant time. 1. Insertion
of leaves and internal nodes. 2.
Deletion of leaves. 3. Deletion of
internal nodes with only one child. 4.
Determining the Least Common Ancestor
of any two nodes.
Conference: Symposium on Discrete
Algorithms - SODA 1999

I've solved your problem in Haskell. Assuming you know the roots of the forest, the solution takes time linear in the size of the forest and constant additional memory. You can find the full code at http://pastebin.com/ha4gqU0n.
The solution is recursive, and the main idea is that you can call a function on a subtree which returns one of four results:
The subtree contains neither m nor n.
The subtree contains m but not n.
The subtree contains n but not m.
The subtree contains both m and n, and the index of their least common ancestor is k.
A node without children may contain m, n, or neither, and you simply return the appropriate result.
If a node with index k has two children, you combine the results as follows:
join :: Int -> Result -> Result -> Result
join _ (HasBoth k) _ = HasBoth k
join _ _ (HasBoth k) = HasBoth k
join _ HasNeither r = r
join _ r HasNeither = r
join k HasLeft HasRight = HasBoth k
join k HasRight HasLeft = HasBoth k
After computing this result you have to check the index k of the node itself; if k is equal to m or n, you will "extend" the result of the join operation.
My code uses algebraic data types, but I've been careful to assume you need only the following operations:
Get the index of a node
Find out if a node is empty, and if not, find its two children
Since your question is language-agnostic I hope you'll be able to adapt my solution.
There are various performance tweaks you could put in. For example, if you find a root that has exactly one of the two nodes m and n, you can quit right away, because you know there's no common ancestor. Also, if you look at one subtree and it has the common ancestor, you can ignore the other subtree (that one I get for free using lazy evaluation).
Your question was primarily about how to save memory. If a linear-time solution is too slow, you'll probably need an auxiliary data structure. Space-for-time tradeoffs are the bane of our existence.

I think that you can simply loop backwards through the array, always replacing the higher of the two indices by its parent, until they are either equal or no further parent is found:
(defun lowest-common-ancestor (array node-index-1 node-index-2)
(cond ((or (null node-index-1)
(null node-index-2))
nil)
((= node-index-1 node-index-2)
node-index-1)
((< node-index-1 node-index-2)
(lowest-common-ancestor array
node-index-1
(find-parent array node-index-2)))
(t
(lowest-common-ancestor array
(find-parent array node-index-1)
node-index-2))))

Related

Generate all the leaf to leaf path in an n-array tree

Given an N-ary tree, I have to generate all the leaf to leaf paths in an n-array tree. The path should also denote the direction. As an example:
Tree:
1
/ \
2 6
/ \
3 4
/
5
Paths:
5 UP 3 UP 2 DOWN 4
4 UP 2 UP 1 DOWN 6
5 UP 3 UP 2 UP 1 DOWN 6
These paths can be in any order, but all paths need to be generated.
I kind of see the pattern:
looks like I have to do in order traversal and
need to save what I have seen so far.
However, can't really come up with an actual working algorithm.
Can anyone nudge me to the correct algorithm?
I am not looking for the actual implementation, just the pseudo code and the conceptual idea would be much appreciated.
The first thing I would do is to perform in-order traversal. As a result of this, we will accumulate all the leaves in the order from the leftmost to the rightmost nodes.(in you case this would be [5,4,6])
Along the way, I would certainly find the mapping between nodes and its parents so that we can perform dfs later. We can keep this mapping in HashMap(or its analogue). Apart from this, we will need to have the mapping between nodes and its priorities which we can compute from the result of the in-order traversal. In your example the in-order would be [5,3,2,4,1,6] and the list of priorities would be [0,1,2,3,4,5] respectively.
Here I assume that our node looks like(we may not have the mapping node -> parent a priori):
class TreeNode {
int val;
TreeNode[] nodes;
TreeNode(int x) {
val = x;
}
}
If we have n leaves, then we need to find n * (n - 1) / 2 paths. Obviously, if we have managed to find a path from leaf A to leaf B, then we can easily calculate the path from B to A. (by transforming UP -> DOWN and vice versa)
Then we start traversing over the array of leaves we computed earlier. For each leaf in the array we should be looking for paths to leaves which are situated to the right of the current one. (since we have already found the paths from the leftmost nodes to the current leaf)
To perform the dfs search, we should be going upwards and for each encountered node check whether we can go to its children. We should NOT go to a child whose priority is less than the priority of the current leaf. (doing so will lead us to the paths we already have) In addition to this, we should not visit nodes we have already visited along the way.
As we are performing dfs from some node, we can maintain a certain structure to keep the nodes(for instance, StringBuilder if you program in Java) we have come across so far. In our case, if we have reached leaf 4 from leaf 5, we accumulate the path = 5 UP 3 UP 2 DOWN 4. Since we have reached a leaf, we can discard the last visited node and proceed with dfs and the path = 5 UP 3 UP 2.
There might be a more advanced technique for solving this problem, but I think it is a good starting point. I hope this approach will help you out.
I didn't manage to create a solution without programming it out in Python. UNDER THE ASSUMPTION that I didn't overlook a corner case, my attempt goes like this:
In a depth-first search every node receives the down-paths, emits them (plus itself) if the node is a leaf or passes the down-paths to its children - the only thing to consider is that a leaf node is a starting point of a up-path, so these are input from the left to right children as well as returned to the parent node.
def print_leaf2leaf(root, path_down):
for st in path_down:
st.append(root)
if all([x is None for x in root.children]):
for st in path_down:
for n in st: print(n.d,end=" ")
print()
path_up = [[root]]
else:
path_up = []
for child in root.children:
path_up += child is not None and [st+[root] for st in print_root2root(child, path_down + path_up)] or []
for st in path_down:
st.pop()
return path_up
class node:
def __init__(self,d,*children):
self.d = d
self.children = children
## 1
## / \
## 2 6
## / \ /
## 3 4 7
## / / | \
## 5 8 9 10
five = node(5)
three = node(3,five)
four = node(4)
two = node(2,three,four)
eight = node(8)
nine = node(9)
ten = node(10)
seven = node(7,eight,nine,ten)
six = node(6,None,seven)
one = node(1,two,six)
print_leaf2leaf(one,[])

How do you find the number of leaves at the lowest level of a complete binary tree?

I'm trying to define an algorithm that returns the number of leaves at the lowest level of a complete binary tree. By a complete binary tree, I mean a binary tree whose every level, except possibly the last, is filled, and all nodes in the last level are as far left as possible.
For example, if I had the following complete binary tree,
_ 7_
/ \
4 9
/ \ / \
2 6 8 10
/ \ /
1 3 5
the algorithm would return '3' since there are three leaves at the lowest level of the tree.
I've been able to find numerous solutions for finding the count of all the leaves in regular or balanced binary trees, but so far I haven't had any luck with the particular case of finding the count of the leaves at the lowest level of a complete binary tree. Any help would be appreciated.
Do a breadth-first search, so you can aswell find a number of nodes on each level.
Some pseudo code
q <- new queue of (node, level) data
add (root, 0) in q
nodesPerLevel <- new vector of integers
while q is not empty:
(currentNode, currentLevel) <- take from top of q
nodesPerLevel[currentLevel] += 1
for each child in currentNode's children:
add (child, currentLevel + 1) in q
return last value of nodesPerLevel

How to adapt Fenwick tree to answer range minimum queries

Fenwick tree is a data-structure that gives an efficient way to answer to main queries:
add an element to a particular index of an array update(index, value)
find sum of elements from 1 to N find(n)
both operations are done in O(log(n)) time and I understand the logic and implementation. It is not hard to implement a bunch of other operations like find a sum from N to M.
I wanted to understand how to adapt Fenwick tree for RMQ. It is obvious to change Fenwick tree for first two operations. But I am failing to figure out how to find minimum on the range from N to M.
After searching for solutions majority of people think that this is not possible and a small minority claims that it actually can be done (approach1, approach2).
The first approach (written in Russian, based on my google translate has 0 explanation and only two functions) relies on three arrays (initial, left and right) upon my testing was not working correctly for all possible test cases.
The second approach requires only one array and based on the claims runs in O(log^2(n)) and also has close to no explanation of why and how should it work. I have not tried to test it.
In light of controversial claims, I wanted to find out whether it is possible to augment Fenwick tree to answer update(index, value) and findMin(from, to).
If it is possible, I would be happy to hear how it works.
Yes, you can adapt Fenwick Trees (Binary Indexed Trees) to
Update value at a given index in O(log n)
Query minimum value for a range in O(log n) (amortized)
We need 2 Fenwick trees and an additional array holding the real values for nodes.
Suppose we have the following array:
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
value 1 0 2 1 1 3 0 4 2 5 2 2 3 1 0
We wave a magic wand and the following trees appear:
Note that in both trees each node represents the minimum value for all nodes within that subtree. For example, in BIT2 node 12 has value 0, which is the minimum value for nodes 12,13,14,15.
Queries
We can efficiently query the minimum value for any range by calculating the minimum of several subtree values and one additional real node value. For example, the minimum value for range [2,7] can be determined by taking the minimum value of BIT2_Node2 (representing nodes 2,3) and BIT1_Node7 (representing node 7), BIT1_Node6 (representing nodes 5,6) and REAL_4 - therefore covering all nodes in [2,7]. But how do we know which sub trees we want to look at?
Query(int a, int b) {
int val = infinity // always holds the known min value for our range
// Start traversing the first tree, BIT1, from the beginning of range, a
int i = a
while (parentOf(i, BIT1) <= b) {
val = min(val, BIT2[i]) // Note: traversing BIT1, yet looking up values in BIT2
i = parentOf(i, BIT1)
}
// Start traversing the second tree, BIT2, from the end of range, b
i = b
while (parentOf(i, BIT2) >= a) {
val = min(val, BIT1[i]) // Note: traversing BIT2, yet looking up values in BIT1
i = parentOf(i, BIT2)
}
val = min(val, REAL[i]) // Explained below
return val
}
It can be mathematically proven that both traversals will end in the same node. That node is a part of our range, yet it is not a part of any subtrees we have looked at. Imagine a case where the (unique) smallest value of our range is in that special node. If we didn't look it up our algorithm would give incorrect results. This is why we have to do that one lookup into the real values array.
To help understand the algorithm I suggest you simulate it with pen & paper, looking up data in the example trees above. For example, a query for range [4,14] would return the minimum of values BIT2_4 (rep. 4,5,6,7), BIT1_14 (rep. 13,14), BIT1_12 (rep. 9,10,11,12) and REAL_8, therefore covering all possible values [4,14].
Updates
Since a node represents the minimum value of itself and its children, changing a node will affect its parents, but not its children. Therefore, to update a tree we start from the node we are modifying and move up all the way to the fictional root node (0 or N+1 depending on which tree).
Suppose we are updating some node in some tree:
If new value < old value, we will always overwrite the value and move up
If new value == old value, we can stop since there will be no more changes cascading upwards
If new value > old value, things get interesting.
If the old value still exists somewhere within that subtree, we are done
If not, we have to find the new minimum value between real[node] and each tree[child_of_node], change tree[node] and move up
Pseudocode for updating node with value v in a tree:
while (node <= n+1) {
if (v > tree[node]) {
if (oldValue == tree[node]) {
v = min(v, real[node])
for-each child {
v = min(v, tree[child])
}
} else break
}
if (v == tree[node]) break
tree[node] = v
node = parentOf(node, tree)
}
Note that oldValue is the original value we replaced, whereas v may be reassigned multiple times as we move up the tree.
Binary Indexing
In my experiments Range Minimum Queries were about twice as fast as a Segment Tree implementation and updates were marginally faster. The main reason for this is using super efficient bitwise operations for moving between nodes. They are very well explained here. Segment Trees are really simple to code so think about is the performance advantage really worth it? The update method of my Fenwick RMQ is 40 lines and took a while to debug. If anyone wants my code I can put it on github. I also produced a brute and test generators to make sure everything works.
I had help understanding this subject & implementing it from the Finnish algorithm community. Source of the image is http://ioinformatics.org/oi/pdf/v9_2015_39_44.pdf, but they credit Fenwick's 1994 paper for it.
The Fenwick tree structure works for addition because addition is invertible. It doesn't work for minimum, because as soon as you have a cell that's supposed to be the minimum of two or more inputs, you've lost information potentially.
If you're willing to double your storage requirements, you can support RMQ with a segment tree that is constructed implicitly, like a binary heap. For an RMQ with n values, store the n values at locations [n, 2n) of an array. Locations [1, n) are aggregates, with the formula A(k) = min(A(2k), A(2k+1)). Location 2n is an infinite sentinel. The update routine should look something like this.
def update(n, a, i, x): # value[i] = x
i += n
a[i] = x
# update the aggregates
while i > 1:
i //= 2
a[i] = min(a[2*i], a[2*i+1])
The multiplies and divides here can be replaced by shifts for efficiency.
The RMQ pseudocode is more delicate. Here's another untested and unoptimized routine.
def rmq(n, a, i, j): # min(value[i:j])
i += n
j += n
x = inf
while i < j:
if i%2 == 0:
i //= 2
else:
x = min(x, a[i])
i = i//2 + 1
if j%2 == 0:
j //= 2
else:
x = min(x, a[j-1])
j //= 2
return x

In what order should you insert a set of known keys into a B-Tree to get minimal height?

Given a fixed number of keys or values(stored either in array or in some data structure) and order of b-tree, can we determine the sequence of inserting keys that would generate a space efficient b-tree.
To illustrate, consider b-tree of order 3. Let the keys be {1,2,3,4,5,6,7}. Inserting elements into tree in the following order
for(int i=1 ;i<8; ++i)
{
tree.push(i);
}
would create a tree like this
4
2 6
1 3 5 7
see http://en.wikipedia.org/wiki/B-tree
But inserting elements in this way
flag = true;
for(int i=1,j=7; i<8; ++i,--j)
{
if(flag)
{
tree.push(i);
flag = false;
}
else
{
tree.push(j);
flag = true;
}
}
creates a tree like this
3 5
1 2 4 6 7
where we can see there is decrease in level.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
The following trick should work for most ordered search trees, assuming the data to insert are the integers 1..n.
Consider the binary representation of your integer keys - for 1..7 (with dots for zeros) that's...
Bit : 210
1 : ..1
2 : .1.
3 : .11
4 : 1..
5 : 1.1
6 : 11.
7 : 111
Bit 2 changes least often, Bit 0 changes most often. That's the opposite of what we want, so what if we reverse the order of those bits, then sort our keys in order of this bit-reversed value...
Bit : 210 Rev
4 : 1.. -> ..1 : 1
------------------
2 : .1. -> .1. : 2
6 : 11. -> .11 : 3
------------------
1 : ..1 -> 1.. : 4
5 : 1.1 -> 1.1 : 5
3 : .11 -> 11. : 6
7 : 111 -> 111 : 7
It's easiest to explain this in terms of an unbalanced binary search tree, growing by adding leaves. The first item is dead centre - it's exactly the item we want for the root. Then we add the keys for the next layer down. Finally, we add the leaf layer. At every step, the tree is as balanced as it can be, so even if you happen to be building an AVL or red-black balanced tree, the rebalancing logic should never be invoked.
[EDIT I just realised you don't need to sort the data based on those bit-reversed values in order to access the keys in that order. The trick to that is to notice that bit-reversing is its own inverse. As well as mapping keys to positions, it maps positions to keys. So if you loop through from 1..n, you can use this bit-reversed value to decide which item to insert next - for the first insert use the 4th item, for the second insert use the second item and so on. One complication - you have to round n upwards to one less than a power of two (7 is OK, but use 15 instead of 8) and you have to bounds-check the bit-reversed values. The reason is that bit-reversing can move some in-bounds positions out-of-bounds and visa versa.]
Actually, for a red-black tree some rebalancing logic will be invoked, but it should just be re-colouring nodes - not rearranging them. However, I haven't double checked, so don't rely on this claim.
For a B tree, the height of the tree grows by adding a new root. Proving this works is, therefore, a little awkward (and it may require a more careful node-splitting than a B tree normally requires) but the basic idea is the same. Although rebalancing occurs, it occurs in a balanced way because of the order of inserts.
This can be generalised for any set of known-in-advance keys because, once the keys are sorted, you can assign suitable indexes based on that sorted order.
WARNING - This isn't an efficient way to construct a perfectly balanced tree from known already-sorted data.
If you have your data already sorted, and know it's size, you can build a perfectly balanced tree in O(n) time. Here's some pseudocode...
if size is zero, return null
from the size, decide which index should be the (subtree) root
recurse for the left subtree, giving that index as the size (assuming 0 is a valid index)
take the next item to build the (subtree) root
recurse for the right subtree, giving (size - (index + 1)) as the size
add the left and right subtree results as the child pointers
return the new (subtree) root
Basically, this decides the structure of the tree based on the size and traverses that structure, building the actual nodes along the way. It shouldn't be too hard to adapt it for B Trees.
This is how I would add elements to b-tree.
Thanks to Steve314, for giving me the start with binary representation,
Given are n elements to add, in order. We have to add it to m-order b-tree. Take their indexes (1...n) and convert it to radix m. The main idea of this insertion is to insert number with highest m-radix bit currently and keep it above the lesser m-radix numbers added in the tree despite splitting of nodes.
1,2,3.. are indexes so you actually insert the numbers they point to.
For example, order-4 tree
4 8 12 highest radix bit numbers
1,2,3 5,6,7 9,10,11 13,14,15
Now depending on order median can be:
order is even -> number of keys are odd -> median is middle (mid median)
order is odd -> number of keys are even -> left median or right median
The choice of median (left/right) to be promoted will decide the order in which I should insert elements. This has to be fixed for the b-tree.
I add elements to trees in buckets. First I add bucket elements then on completion next bucket in order. Buckets can be easily created if median is known, bucket size is order m.
I take left median for promotion. Choosing bucket for insertion.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
3 2 1 Order to insert buckets.
For left-median choice I insert buckets to the tree starting from right side, for right median choice I insert buckets from left side. Choosing left-median we insert median first, then elements to left of it first then rest of the numbers in the bucket.
Example
Bucket median first
12,
Add elements to left
11,12,
Then after all elements inserted it looks like,
| 12 |
|11 13,14,|
Then I choose the bucket left to it. And repeat the same process.
Median
12
8,11 13,14,
Add elements to left first
12
7,8,11 13,14,
Adding rest
8 | 12
7 9,10,|11 13,14,
Similarly keep adding all the numbers,
4 | 8 | 12
3 5,6,|7 9,10,|11 13,14,
At the end add numbers left out from buckets.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
For mid-median (even order b-trees) you simply insert the median and then all the numbers in the bucket.
For right-median I add buckets from the left. For elements within the bucket I first insert median then right elements and then left elements.
Here we are adding the highest m-radix numbers, and in the process I added numbers with immediate lesser m-radix bit, making sure the highest m-radix numbers stay at top. Here I have only two levels, for more levels I repeat the same process in descending order of radix bits.
Last case is when remaining elements are of same radix-bit and there is no numbers with lesser radix-bit, then simply insert them and finish the procedure.
I would give an example for 3 levels, but it is too long to show. So please try with other parameters and tell if it works.
Unfortunately, all trees exhibit their worst case scenario running times, and require rigid balancing techniques when data is entered in increasing order like that. Binary trees quickly turn into linked lists, etc.
For typical B-Tree use cases (databases, filesystems, etc), you can typically count on your data naturally being more distributed, producing a tree more like your second example.
Though if it is really a concern, you could hash each key, guaranteeing a wider distribution of values.
for( i=1; i<8; ++i )
tree.push(hash(i));
To build a particular B-tree using Insert() as a black box, work backward. Given a nonempty B-tree, find a node with more than the minimum number of children that's as close to the leaves as possible. The root is considered to have minimum 0, so a node with the minimum number of children always exists. Delete a value from this node to be prepended to the list of Insert() calls. Work toward the leaves, merging subtrees.
For example, given the 2-3 tree
8
4 c
2 6 a e
1 3 5 7 9 b d f,
we choose 8 and do merges to obtain the predecessor
4 c
2 6 a e
1 3 5 79 b d f.
Then we choose 9.
4 c
2 6 a e
1 3 5 7 b d f
Then a.
4 c
2 6 e
1 3 5 7b d f
Then b.
4 c
2 6 e
1 3 5 7 d f
Then c.
4
2 6 e
1 3 5 7d f
Et cetera.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
Edit note: since the question was quite interesting, I try to improve my answer with a bit of Haskell.
Let k be the Knuth order of the B-Tree and list a list of keys
The minimization of space consumption has a trivial solution:
-- won't use point free notation to ease haskell newbies
trivial k list = concat $ reverse $ chunksOf (k-1) $ sort list
Such algorithm will efficiently produce a time-inefficient B-Tree, unbalanced on the left but with minimal space consumption.
A lot of non trivial solutions exist that are less efficient to produce but show better lookup performance (lower height/depth). As you know, it's all about trade-offs!
A simple algorithm that minimizes both the B-Tree depth and the space consumption (but it doesn't minimize lookup performance!), is the following
-- Sort the list in increasing order and call sortByBTreeSpaceConsumption
-- with the result
smart k list = sortByBTreeSpaceConsumption k $ sort list
-- Sort list so that inserting in a B-Tree with Knuth order = k
-- will produce a B-Tree with minimal space consumption minimal depth
-- (but not best performance)
sortByBTreeSpaceConsumption :: Ord a => Int -> [a] -> [a]
sortByBTreeSpaceConsumption _ [] = []
sortByBTreeSpaceConsumption k list
| k - 1 >= numOfItems = list -- this will be a leaf
| otherwise = heads ++ tails ++ sortByBTreeSpaceConsumption k remainder
where requiredLayers = minNumberOfLayersToArrange k list
numOfItems = length list
capacityOfInnerLayers = capacityOfBTree k $ requiredLayers - 1
blockSize = capacityOfInnerLayers + 1
blocks = chunksOf blockSize balanced
heads = map last blocks
tails = concat $ map (sortByBTreeSpaceConsumption k . init) blocks
balanced = take (numOfItems - (mod numOfItems blockSize)) list
remainder = drop (numOfItems - (mod numOfItems blockSize)) list
-- Capacity of a layer n in a B-Tree with Knuth order = k
layerCapacity k 0 = k - 1
layerCapacity k n = k * layerCapacity k (n - 1)
-- Infinite list of capacities of layers in a B-Tree with Knuth order = k
capacitiesOfLayers k = map (layerCapacity k) [0..]
-- Capacity of a B-Tree with Knut order = k and l layers
capacityOfBTree k l = sum $ take l $ capacitiesOfLayers k
-- Infinite list of capacities of B-Trees with Knuth order = k
-- as the number of layers increases
capacitiesOfBTree k = map (capacityOfBTree k) [1..]
-- compute the minimum number of layers in a B-Tree of Knuth order k
-- required to store the items in list
minNumberOfLayersToArrange k list = 1 + f k
where numOfItems = length list
f = length . takeWhile (< numOfItems) . capacitiesOfBTree
With this smart function given a list = [21, 18, 16, 9, 12, 7, 6, 5, 1, 2] and a B-Tree with knuth order = 3 we should obtain [18, 5, 9, 1, 2, 6, 7, 12, 16, 21] with a resulting B-Tree like
[18, 21]
/
[5 , 9]
/ | \
[1,2] [6,7] [12, 16]
Obviously this is suboptimal from a performance point of view, but should be acceptable, since obtaining a better one (like the following) would be far more expensive (computationally and economically):
[7 , 16]
/ | \
[5,6] [9,12] [18, 21]
/
[1,2]
If you want to run it, compile the previous code in a Main.hs file and compile it with ghc after prepending
import Data.List (sort)
import Data.List.Split
import System.Environment (getArgs)
main = do
args <- getArgs
let knuthOrder = read $ head args
let keys = (map read $ tail args) :: [Int]
putStr "smart: "
putStrLn $ show $ smart knuthOrder keys
putStr "trivial: "
putStrLn $ show $ trivial knuthOrder keys

Binary Tree represented using array

Consider the following array, which is claimed to have represented a binary tree:
[1, 2, 5, 6, -1, 8, 11]
Given that the index with value -1 indicates the root element, I've below questions:
a) How is this actually represented?
Should we follow below formulae (source from this link) to figure out the tree?
Three simple formulae allow you to go from the index of the parent to the index of its children and vice versa:
* if index(parent) = N, index(left child) = 2*N+1
* if index(parent) = N, index(right child) = 2*N+2
* if index(child) = N, index(parent) = (N-1)/2 (integer division with truncation)
If we use above formulae, then index(root) = 3, index(left child) = 7, which doesn't exist.
b) Is it important to know whether it's a complete binary tree or not?
N=0 must be the root node since by the rules listed, it has no parent. 0 cannot be created from either of the expressions (2*N + 1) or (2*N + 2), assuming no negative N.
Note, index is not the value stored in the array, it is the place in the array.
For [1, 2, 5, 6, -1, 8, 11]
Index 0 = 1
Index 1 = 2
Index 2 = 5, etc.
If it is a complete tree, then -1 is a valid value and tree is
1
/ \
2 5
/ \ / \
6 -1 8 11
-1 could also be a "NULL" pointer, indicating no value exists at that node.
So the Tree would look like
1
/ \
2 5
/ / \
6 8 11
Given an array, you could think of any number of ways how could that array represent a binary tree. So there is no way to know, you have to go to the source of that array (whatever that is).
One of those ways is the way binary heap is usually represented, as per your link. If this was the representation used, -1 would not be the root element. And the node at position 3 would have no children, i.e. it would be a leaf.
And, yeah, it's probably important to know whether it's supposed to be a complete tree or not.
In general, you shouldn't try to figure out what does some data mean like this. You should be given documentation or the source code that uses the data. If you don't have that and you really need to reverse-engineer it, you most likely need to know more about the data. Observing the behavior of the code that uses it should help you. Or decompiling the code.
It may not be a complete binary tree, but it may not be an arbitrary one either. You can represent a tree in which at most a few of the rightmost few leaves are missing (or, if you exchange the convention for left and right children, at most a few of the leftmost leaves missing).
You can't represent this in your array:
A
/ \
B C
/ /
D E
But you can represent this
A
/ \
B C
/ \
D E
or this:
A
/ \
B C
/ \
D E
(for the last, have 2k+1 be the right child and 2k+2 the left child)
You only need to know to number of nodes in the three.

Resources