I'm reading the paper, making B+-trees cache conscious in main memory. In Section 3.1.2, authors describe several approaches to searching within a CSB+ tree node.
Tha basic approach is to simply do a binary search using a conventional while loop.
The uniform approach is through code expansion, unfolding the while loop into if-then-else statements assuming all the keys are used.
Authors give the following example which exhibits an unfolding of the search for a node with up to 9 keys. The number in a node represents the position of the key being used in an if test
4
/ \
2 6
/ \ / \
1 3 5 8
/ \
7 9
Then comes the confusing part:
If only 5 keys were actually present, we could traverse this tree with exactly 3 comparisons. On the other hand, an unfolding that put the deepest subtree at the left instead of the right would need 4 comparisons on some branches.
So why would it need more comparisons in the following tree:
6
/ \
4 8
/ \ / \
2 5 7 9
/ \
1 3
Furthermore,
if we knew we had only five valid keys, we could hardcode a tree that, on average, used 2.67 comparisons rather than 3.
How does 2.67 come about?
Any hints would be appreciated. Also, a link directing me to code expansion knowledge would be helpful.
Actually, I'm not sure whether it's appropriate to ask a question on a paper because some key information may have been left out when transcribed here (the question may need reformatting). I just wish there could be someone who happens to have read the paper.
Thanks
The key point here is the following quotation from the same section:
we pad all the
unused keys (keyList[nKeys..2d-1]) in a node
with the largest possible key
Also is is important that in B+/CSB+ tree we search not for node values, but for intervals between these values. A set of possible values is split by 5 keys into 6 intervals.
Since most of the right sub-tree is filled with the largest possible key (L), we need less comparisons than usual:
4
/ \
2 L
/ \ / \
1 3 5 L
/ \
L L
Right descendant of root node has largest possible key, there is no need to check any node to the right of it. And exactly 3 comparisons are needed for every interval:
interval comparisons
up to 1 k>4, k>2, k>1
1..2 k>4, k>2, k>1
2..3 k>4, k>2, k>3
3..4 k>4, k>2, k>3
4..5 k>4, k>L, k>5
5..L k>4, k>L, k>5
If we put the deepest subtree at the left, we have a tree, where non-empty nodes are placed one level deeper:
L
/ \
4 L
/ \ / \
2 5 L L
/ \
1 3
Search for node "1" in this tree requires to compare the key with 4 different nodes: L, 4, 2, and 1.
If we know we have only five valid keys, we have the following tree:
2
/ \
1 4
/ \
3 5
Here we can arrange comparisons in a way, that gives 2.67 comparisons on average:
interval comparisons
up to 1 k>2, k>1
1..2 k>2, k>1
2..3 k>2, k>4, k>3
3..4 k>2, k>4, k>3
4..5 k>2, k>4, k>5
5..L k>2, k>4, k>5
"Code expansion" is not a widely used term, so I cannot give you the most relevant link. I think, this is not very different from "Loop unwinding".
Related
Given a tree, find the common subtrees and replace the common subtrees and compact the tree.
e.g.
1
/ \
2 3
/ | /\
4 5 4 5
should be converted to
1
/ \
2 3
/ | /\
4 5 | |
^ ^ | |
|__|___| |
|_____|
this was asked in my interview. The approach i shared was not optimal O(n^2), i would be grateful if someone could help in solutioning or redirect me to a similar problem. I couldn't find any. Thenks!
edit- more complex eg:
1
/ \
2 3
/ | /\
4 5 2 7
/\
4 5
whole subtree rooted at 2 should be replaced.
1
/ \
2 <--3
/ | \
4 5 7
You can do this in a single DFS traversal using a hash map from (value, left_pointer, right_pointer) -> node to collapse repeated occurrences of the subtree.
As you leave each node in your DFS, you just look it up in the map. If a matching node already exists, then replace it with the pre-existing one. Otherwise, add it to the map.
This takes O(n) time, because you are comparing the actual pointers to the left + right subtrees, instead of traversing the trees to compare them. The pointer comparison gives the same result, because the left and right subtrees have already been canonicalized.
Firstly, we need to store the node values that appear in a hash table. If the tree already exists, we can iterate the tree and if a node value is already in the set of nodes and delete the branches of that node. Otherwise, store the values in a hash map and each time, when a new node is made, check if the value appears in the map.
I wonder if there is any relation between data_array data position to tree_array data position.
int data[N];
int tree[M]; // lets M = 2^X-1, where X = nearest ceiling power of 2 to N;
void build_segment_tree();
I wonder if I can say n'th value of data[] is mapped with i'th value of tree[]. is there any mathematical resolution?
You certainly can. For example segment tree is used for it's capapbility to store
segment information.
Now you will see that if you want to create a segment tree out of N elements then
you will need ceil(log_2(N))+1 levels. And in the last level you will find all the
1 length-range or the single elements.
These elements will be precisely in the position (1-index) 2^ceil(log_2(N)) to 2^ceil(log_2(N))+N-1.
[1-8]
/ \
[1-4] [5-8]
/ \ / \
[1-2][3-4] [5-6][7-8]
/\ /\ /\ /\
[1][2] [3][4] [5][6] [7][8]
1-11
/ \
1-6 7-11
1-3 4-6 7-9 10-11
1-2 3 4-5 6 7-8 9 10 11
1 2 4 5 7 8
This answer is for only valid for segment tree of power of 2 elements.
But for other elements the elements are not necessarily organized.
So the answer will be false for N those are not power of 2.
On that case you can't find any formualitve rule.
If I delete node x and then node y or delete y and the x, after this deleted I will stay with the same binary search tree?
I tried a few examples and I think that's true.
But how can i prove this?
It's false. Consider the tree
4
/ \
1 5
\
2
\
3 ,
from which 4 and 5 are deleted in some order. If the order is 5 then 4, the result is
1
\
2
\
3 .
If the order is 4 then 5, the result could be
3
/
1
\
2 ,
assuming that, when we would delete a node with two children, we instead replace its value by that of its in-order predecessor and delete the predecessor. (I'm assuming also the standard deletion procedure for zero- and one-child nodes.)
Although I found this example by hand, I often turn to computer assistance.
The question may look very simple, and probably the answer is too, but I always get confused in the tree questions.
Ok so I want to make a tree something like:
3 level 0
/ \
4 5 level 1 ..
/ \ / \
6 7 8
/ \ / \ / \
9 10 11 12
What are such trees called? Sorry, I'm a beginner..
Function can pass an array[] of ints, or function can take input till N = 3 (denoting level 3 with 10 nodes). Also can you give solution in C/C++/Java.
Given your requirements are only for traversal, I would simply implement this using an array a, containing each level as a contiguous sub-array. Level i then occurs in entries L(i-1) up to but not including L(i), where L(n) = n*(n+1)/2. In particular, the jth value on the ith level is in a[L(i-1)+j].
As long as you always keep track of i and j, you can now easily navigate through your pyramid.
I wanted to draw a balanced binary search tree for numbers from 1 to 20.
_______10_______
/ \
___5___ 15
/ \ / \
3 8 13 18
/ \ / \ / \ / \
2 4 7 9 12 14 17 19
/ / / /
1 6 11 16
Is the above tree correct and balanced?
In answer to your original question as to whether or not you need to first calculate the height, no, you don't need to. You just have to understand that a balanced tree is one where the height difference between the tallest and shortest node is zero or one, and the simplest way to achieve this is to ensure that you always pick the midpoint of the possible list, when populating the top node in a sub-tree.
Your sample tree is balanced since all leaf nodes are either at the bottom or next-to-bottom level, hence the difference in heights between any two leaf nodes is at most one.
To create a balanced tree from the numbers 1 through 20 inclusive, you can just make the root entry 10 or 11 (the midpoint being 10.5 for those numbers), so that there's an equal quantity of numbers in either sub-tree.
Then just do that recursively for each sub-tree. On the lower side of 10, 5 is the midpoint:
10
/ \
5 11-thru-19 sub-tree
/ \
1-thru-4 6-thru-9
sub-tree sub-tree
Just expand on that and you'll end up with something like:
_______10_______
/ \
___5___ 15
/ \ / \
2 7 13 17
/ \ / \ / / \
1 3 6 8 11 16 18 <- depth of highest leaf node
\ \ \ \
4 9 12 19 <- depth of lowest leaf node
^
|
Difference is 1
The midpoint can be found at the number where the difference between quantities above and below that numbers is one or zero. For the whole list of numbers 1 through 20 inclusive, there are nine less than 10 and ten greater than 10 (or, if you chose 11 as the midpoint, the quantities are ten and nine).
The difference between your sample and mine is probably to do with the fact that I preferred to pick the midpoint by rounding down where there was a choice (meaning my right sub-trees tend to be "heavier"). Because your left sub-trees are heavier, you appear to have rounded up.
After choosing 10 as the initial midpoint, there's no leeway on the left sub-tree, you have to choose 5 since it has four above and below it. Any other midpoint would result in a difference of at least two between the two halves (for example, choosing 4 as the midpoint would have the two halves of size three and five). This can still give you a balanced sub-tree depending on the data but it's "safer" to choose the midpoint.