How to do B-Tree Insert - data-structures

I am trying to insert 3 values into this B-Tree, 60, 61, and 62. I understand how to insert values when a node is full, and has an empty parent, but what if the parent is full?
For example, when I insert 60 and 61, that node will now be full. I can't extend the parent, or the parent of the parent (because they are full). So can I change the values of the parent? I have provided an image of the B-tree prior to my insert, and after.
Attempt to insert 60, 61, 62:
Notice I changed the 66 in the root to 62, and added 62 to to the <72 node. Is this the correct way to do this?

With the insertion you've done, you get what's normally referred to as a B* tree. In a "pure" B-tree, the insertion when the root is full would require splitting the current root into two nodes, and creating a new root node above them (B-tree implementations do not require that root node to follow the same rule as other nodes for the minimum number of descendants, so having only two would be allowed).

Related

How to estimate the number of nodes between the current node and some other node in Kademlia?

In Kademlia, all (key, value) pairs stored by the node, aside from the ones that were originally published by the current node itself, have an expiration time based on where the current node is located in relation to the key. If the current node is of the k-closest nodes to the key, the (key, value) pair expires when it outlives the 24h from when it was originally published. If it's not a k-closest node, the expiration time is
inversely proportional to the number of nodes between the current node and the node whose ID is closest to the key ID
as per Kademlia paper. The paper also says that
this number can be inferred from the bucket structure of the current node.
There seem to be two very different ways to count nodes between the current node and the given node, and I'm not sure which one is right. We assume flat array routing table implementation next, an array with 160 buckets pre-allocated.
Xlattice Kademlia Design Specification page says that you should find a bucket index j into which the given node would fall into and count the nodes in buckets 0..j, counting all nodes in 0..j-1 and counting only nodes that are closer to the current node than the key in the final bucket j.
"Implementation of the Kademlia Distributed Hash Table" semester thesis by Bruno Spori (section "Calculation of the Number of Intermediate Nodes") calculates the distance between the current node and the given node, and counts nodes only in the buckets that have the distance less or equal to the distance between the current node and the given node.
Both those methods seem right to me, but they are completely different and yield different results. The first method counts nodes between the current node and the given node based on the distance between the current node and other nodes in our buckets. The second method counts nodes between the current node and the given node based on the distance between the given node and other nodes in our buckets.
For example, if the current node has ID 0001_0100 (let's assume 8-bit ids for the sake of the example), there are only 8 buckets that contain nodes with the following prefixes:
0001_0101, 0001_011x, 0001_00xx, 0001_1xxx, 0000_xxxx, 001x_xxxx, 01xx_xxxx, 1xxx_xxxx.
Let's say we want to calculate expiration time for a key 1010_0010.
The distance between the current node and the given node is 182 (0001_0100 xor 1010_0010 = 182).
Using the first method, we would count all nodes in buckets 0..6 plus the nodes in the bucket 7 that are closer to the current node than the given ID.
This works because the distances between the current node and all the buckets are: 1, 2, 4, 12, 20, 52, 84, 148.
You can find them by xoring our ID with the range the bucket covers (I replaced x with 0 to get the smallest, but not necessarily the closest, ID that would fall into that bucket), e.g. 0001_0100 xor 0001_0101 = 1 and 0001_0100 xor 1000_0000 = 148.
All nodes up to the last one would have nodes that are <= 182 (the distance between the current node and the given ID) away from the current node. The last bucket can have nodes that are further away.
So we count the number of nodes in all 8 buckets (with partially counting the last one).
Using the second method, we count all nodes in buckets 1, 2, 4, 5 and 7. We do not count buckets 0, 3, 6.
This works because the distances between the given ID and the buckets are: 183, 180, 178, 186, 162, 130, 226, 34.
You can find them by xoring the given ID with the range the bucket covers (I replaced x with 0 to get the smallest, but not necessarily the closest, ID that would fall into that bucket), e.g. 1010_0010 xor 0001_0101 = 183 and 1010_0010 xor 1000_0000 = 34.
Only buckets 1, 2, 4, 5 and 7 have nodes with a distance less than 182 (the distance between the current node and the given ID) with respect to the given ID.
So we count nodes only in 5 buckets out of 8.
Counting nodes in 8/8 buckets and 5/8 buckets is a big difference. Which method is the right one? Both seem to count the number of nodes between the current node and the given key, but the result is so different.
Note that the xor metric holds here, there doesn't seem to be any mistake made here. For example, the distance between the current node and a node residing the last bucket is 0001_0100 xor 1000_0000 = 148. The distance between the given node and the same node in the last bucket is 1010_0010 xor 1000_0000 = 34. 148 xor 34 = 182, so d(a, b) = d(a, c) xor d(c, b) holds.
The first method seems to count all nodes the current node knows of that are closer than 182 from the current node. The second method seems to count all nodes the current node knows of that are closer than 182 from the given node.
I'm thinking that the second method is more correct, as we want to find if we are a k-closest node to the given key. When finding nodes close to a given ID, i.e. the FIND_NODE RPC, you do so using a process similar to the second method to identify which buckets contain nodes closest to the given ID, e.g. in the given example those would be buckets 7, 5, 4, 2, 1, 0, 3, 6 - in that exact order, with the closest first.
But then again, the first method also makes sense as we know the best about our own surrounding. We know the whole 8 buckets of nodes closer than 182 to the current node, while we know only about 5 buckets of nodes closer than 182 to the given key.
I lack intuition for the flat routing table layout, having long switched to the tree-based layout in my own implementation. So I will argue based on the latter instead.
The simplest approach for decaying the storage time in the tree-based layout to see into which bucket a storage key would fall. If it falls into the deepest bucket (depth d) it gets the full time (T). For d-1 it gets T/2, for d-2 it gets T/4 and so on.
This is can be inaccurate if non-local splitting is implemented, in which case one should consider the shallowest bucket of the k-closest set as the maximum depth.
An alternative approach, that should also work with the flat layout, is that one first estimates the global node density in the keyspace and then uses the rule of three to get the node count for any distance. The estimates can be obtained various ways. Using the distances in the k-closest set from the routing table for the local node ID is the simplest but also noisiest one (which should be equivalent to the correction for non-local splitting above).
To check the algorithms I would use numerical simulation since even millions of IDs and distances can be calculated in a few seconds. Construct populations of N nodes (for varying N), represented by their IDs, then build a bunch of perfect routing tables for random node IDs within each population and then run your estimator for a bunch of intervals and calculate the error relative to the actual count from the simulated population.

Did I correctly perform extract max operation on this max-heap?

I am trying to understand how heaps work.
I have the following heap:
Now I want to extract the max value.
The first thing I do is delete the root 42, then put the last element in the heap (6) at the root position. I then perform max-heapify to find the correct spot for 6.
6 i larger than its two children, so I swap it with the largest child 41, making 41 the new root.
6 now has the children 3 and 9, I therefore again swap it with the larger child 9
In the end I end up with the heap
Did I correctly perform extract-max?
Yes!
Extract max works recursively.find the largest element among the three i.e parent and two of their children.If largest is not parent swap largest element to parent and call extract max to largest.

Data structure/ Retrieving elements parent

im looking a way to find out any common elements for two parent elements.
For example, parents here are 1 and 2 (Ignore the below values)
And the common value for those parents are 91.
Parent - value that is on top and has NO parent.
Next example :
Here we have 3 parents. and quite a lot of common elements for them. :
91,
92,
93,
911,
912,
931,
932,
9311,
9312.
Main problem is to get the comon elements. Mabey any suggestions on how could i store them aswell?
Run a BFS/DFS (doesn't really matter which one) from the first node and store a visited bit for every node (say in a vector/array of bool).
Now run the same algorithm again from the second node. Every time you reach a new node check if it has been visited by the first run as well. If it was then the node is one of the common parents so output to whatever you want.

B-Trees insertion

How do I add 35?
How do I know whether to move a key up(up to the node with 34 and 78, and if I do that, which key do I move up) and make more children(to fulfill the "A non-leaf node with k children contains k−1 keys." rule)
OR
just split up the 39,44,56,74(and 35) node into three children, like what I did in step 8.
AFAIK the insertion procedure is the following:
Find a position (node). Insert. Split if necessary. In this case 39,44,56,74 becomes 35,39,44,56,74. You need to split now: the new nodes are 35,39 and 56, 74 and the parent is now 34,44,78
Look at this example.

Why is this binary tree not a heap?

I have been teaching myself heaps for an exam and came across the following question:
"Give 2 different reasons why the following binary tree is not a heap"
91
/ \
77 46
/ \ \
68 81 11
I know one of the reasons this is because a heap's children must less than or equal to the value of its parent so 81 violates this rule as 81 > 77, but I am not sure on the other answer.
Could somebody please clarify?
11 should be the left-child of 46, not the right-child.
Wikipedia mentions that a binary heap should be a complete binary tree, which means "every level, except possibly the last, is completely filled, and all nodes are as far left as possible", which is clearly not the case if 11 is where it is now.
The reason why this is advantageous is fairly easy to understand - given the size of the heap, you can quickly determine where the last node on the bottom level is, which is necessary to know for insertion and deletion. If we're using an array representation, it's as simple as the element at heap size - 1 being the last element. For a pointer-based representation, we could easily determine whether we should go left or right to get to the last element.
There may be other ways to get the same performance without the heap being a complete binary tree, but they'd likely add complexity.
That is not a heap because it doesn't conform to the heap property.
In a min-heap, every node's value is less than or equal to its child nodes' values.
In a max-heap, every node's value is greater than or equal to its child nodes' values.
It's clearly not a min-heap because the root node, 91, is larger than either of its children.
And it's clearly not a max-heap because the node 77 is smaller than its right child, 81.
And, as #Dukeling pointed out in his answer, it doesn't conform to the shape property.

Resources