Is there a possible way to find the original insertion order of a B+ tree?
I have this tree:
{ [ (1 2) 3 (5 6 7) ] 8 [ (9 10) 11 (12 13) 14 (14 16 17) 18 ( 19 20) ] }
Example tree
No.
For example in your case, the last two inserts could have been 7, 17 or 17, 7 and there is absolutely no way to tell which was which. Indeed of 5, 6, 7 one of the three was inserted after the other two and there is no record of which is which.
This can also be seen immediately from the pigeon hole principle. First let's put an upper bound on how many k-way B-trees there can be with n things in it.
The structure of any b-tree with n things can be encoded as a stream of the sizes of the nodes. From the size of the top node, you know what how many second tier nodes there are, ditto for third tier from second tier. A node can have anywhere from 1..k things in it. There cannot be more nodes than elements. Therefore we can specify a B-tree by first specifying how many nodes there are, then the sizes of the nodes. (Not all sets of numbers will be a B-tree.) For every size s of B-tree, there are k^s <= k^n of them. Therefore n k^n is an upper bound on how many k-way B-trees there can be. Which is exponential growth.
But the number of orders in which elements might be inserted is n!. This function grows strictly faster than exponential growth, and so you cannot recover the order from the B-tree.
Related
Assume that a max heap with 10^6 elements is stored in a complete 7-ary tree. Approximately how many comparisons a call to removeMin() will make?
5000
50
10^6
500
5
My solution: The number of comparisons should be equal to the number of leaf nodes at most because in max heap, the min. can be found at any of the leaf nodes which is not in the above options. Better approach was to take the square of ( log of 10^6 to the base 7) which gives 50 but this is only when we are sure that the minimum element will follow a single branch across tree which in the case of max heap is not correct.
I hope that you can help.
There's no "natural" way to remove the minimum value from a max heap. You simply have to look at all the leaf nodes to figure out which one happens to be the minimum.
The question then is how many leaf nodes there are. Intuitively, we'd expect the fraction of nodes in the heap that are leaves to be pretty close to the total number of nodes. Take it to the limit - if you have a 1,000,000-ary heap, you'd have one node in the top layer and all remaining 999,999 elements in the next layer. Even in the smallest case where the heap is a binary heap, you'd expect roughly half the elements to be in the bottom layer.
More specifically, let's do some math! How many leaves will a 7-ary heap with n nodes have? Well, each node in the tree will either
be a leaf, or
have seven children,
with one possible exception that, since the bottommost row might not be full, there might be one node with fewer than seven children. Since that's just a one-off, we can ignore that last node when we're dealing with millions of elements. A quick proof by induction can be used to show that any tree where each node either has no children or seven children will have seven times as many leaf nodes as internal nodes (prove this!), so we'd expect than (7/8)ths of the nodes will be leaves, for a total of 875,000 leaves to check.
As a result, the best answer here would be roughly 106 comparisons.
Min element can be any of the leaves of a max heap or any type, and there's no order there. All elements from A[10^6/7 + 1] onwards (where A is the array storing the leaves) are leaf nodes and need to be checked. This means 8571412 comparisons just to find the minimum. After that there is no simple way to 'remove' the minimum without introducing a gap that cannot be filled by simply shifting the leaves.
This is a misprint. Maybe the teacher wanted to ask removeMax, for which the answer is close to 50 -- see below:
There are 7 comparisons per level done by the heapify since each node has 7 children. If h is the height of the heap then that's 7*h comparisons.
Rough analysis: (here ~ means approximately)
h ~ log_7(10^6) = 7.1, hence total comparisons 7*7.1 ~ 50
More accurate analysis:
A 7-ary heap would have elements: 1 + 7 + 7^2 + ... + 7^h = 10^6
On the left side is a geometric series, that sums to: (7^h -1)/6 = 10^6
=> 7^h = 6*10^6 + 1
=> h = lg_7(6*10^6 + 1) = 8 (approximately) , hence 7*8 = 56, still from the options 50 is the closest.
*A is array to sort heap.
Suppose we are given a set of n numbers initially. We need to construct a data structure that supports Search(), Predecessor() and Deletion() queries.
Search and Predecessor should take worst case O(log processing n) time. Deletion should take amortized O(log n) time. Pre-time allowed is O(n). We initially have O(n) space. But I want that at any stage, if their are m numbers in the structure, the space used should be O(m)
This problem can be solved trivially using RBT, but I want a data structure that makes use of array(s) only. I don't want an implementation of RBT using arrays, the data structure shouldn't use any algorithms inspired from trees. Can anyone think of one such data structure?
I can suggest some kind of tree structure, which is easy than RBT. To simplify it description, let number of elements in it will be 2^k.
First level just numbers.
Second a level a has 2^(k-1) numbers, like in binary tree.
Next level a has 2^(k-2) numbers and so on until we have only one number.
So we have a binary tree with 2^(k+1) nodes, and parent will contain a maximum of it children values. To build this tree we will be O(N) time. Tree will consumer O(n) space, first level consumer n, second n/2, third n/4 ... and so on, so total space will be n + n/2 + n/4 + ... = 2n = O(n). E.g. we will have following tree for numbers 1,2,4,6,8,9,12,14:
1 2 4 7 8 9 12 14
2 7 9 14
7 14
14
To delete element we need find it using binary search and mark them null in list. Update tree, if both children NULL put NULL to the node. It will take k operation (the tree height) or O(log(N)).
E.g. we delete 12.
1 2 4 7 8 9 NULL(12) 14
2 7 9 14
7 14
14
Now delete 7.
1 2 4 NUll(7) 8 9 NULL(12) 14
2 4 9 14
4 14
14
Now we should search or predecessor in O(log(N)). By binary search find in first level our element or it predecessor. If it is not NULL we get the answer. If NULL go level up, we have three variants here:
Node has NULL value, go to upper level
Node's left child has a not NULL value. It is smaller then requested number (due tree structure) and it is the answer.
Node's value is bigger the then requested number, go upper.
Here is the interview problem: Designing a data structure for a range of integers {1,...,M} (numbers can be repeated) support insert(x), delete(x) and return mode which return the most frequently number.
The interviewer said that we can do in O(1) for all the operation with preprocessed in O(M). He also accepted that I can do insert(x) and delete(x) in O(log(n)), return mode in O(1) with preprocessed in O(M).
But I can only give in O(n) for insert(x) and delete(x) and return mode in O(1), actually how can I give O(log (n)) or/and O(1) in insert(x) and delete(x), and return mode in O(1) with preprocessed in O(M)?
When you hear O(log X) operations, the first structures that comes to mind should be a binary search tree and a heap. For reference: (since I'm focussing on a heap below)
A heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. ... The keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node (this kind of heap is called max heap) ....
A binary search tree doesn't allow construction (from unsorted data) in O(M), so let's see if we can make a heap work (you can create a heap in O(M)).
Clearly we want the most frequent number at the top, so this heap needs to use frequency as its ordering.
But this brings us to a problem - insert(x) and delete(x) will both require that we look through the entire heap to find the correct element.
Now you should be thinking "what if we had some sort of mapping from index to position in the tree?", and this is exactly what we're going to have. If all / most of the M elements exist, we could simply have an array, with each index i's element being a pointer to the node in the heap. If implemented correctly, this will allow us to look up the heap node in O(1), which we could then modify appropriately, and move, taking O(log M) for both insert and delete.
If only a few of the M elements exist, replacing the array with a (hash-)map (of integer to heap node) might be a good idea.
Returning the mode will take O(1).
O(1) for all operations is certainly quite a bit more difficult.
The following structure comes to mind:
3 2
^ ^
| |
5 7 4 1
12 14 15 18
To explain what's going on here - 12, 14, 15 and 18 correspond to the frequency, and the numbers above correspond to the elements with said frequency, so both 5 and 3 would have a frequency of 12, 7 and 2 would have a frequency of 14, etc.
This could be implemented as a double linked-list:
/-------\ /-------\
(12) <-> 5 <-> 3 <-> (13) <-> (14) <-> 7 <-> 2 <-> (15) <-> 4 <-> (16) <-> (18) <-> 1
^------------------/ ^------/ ^------------------/ ^------------/ ^------/
You may notice that:
I filled in the missing 13 and 16 - these are necessary, otherwise we'll have to update all elements with the same frequency when doing an insert (in this example, you would've needed to update 5 to point to 13 when doing insert(3), because 13 wouldn't have existed yet, so it would've been pointing to 14).
I skipped 17 - this is just be an optimization in terms of space usage - this makes this structure take O(M) space, as opposed to O(M + MaxFrequency). The exact conditions for skipping a number is simply that it doesn't have any elements at its frequency, or one less than its frequency.
There's some strange things going on above the linked-list. These simply mean that 5 points to 13 as well, and 7 points to 15 as well, i.e. each element also keeps a pointer to the next frequency.
There's some strange things going on below the linked-list. These simply mean that each frequency keeps a pointer to the frequency before it (this is more space efficient than each element keeping a pointer to both it's own and the next frequency).
Similarly to the above solution, we'd keep a mapping (array or map) of integer to node in this structure.
To do an insert:
Look up the node via the mapping.
Remove the node.
Get the pointer to the next frequency, insert it after that node.
Set the next frequency pointer using the element after the insert position (either it is the next frequency, in which case we can just make the pointer point to that, otherwise we can make this next frequency pointer point to the same element as that element's next frequency pointer).
To do a remove:
Look up the node via the mapping.
Remove the node.
Get the pointer to the current frequency via the next frequency, insert it before that node.
Set the next frequency pointer to that node.
To get the mode:
Return the last node.
Since range is fixed, for simplicity lets take an example M=7 (range is 1 to 7). So we need atmost 3 bit to represent each number.
0 - 000
1 - 001
2 - 010
3 - 011
4 - 100
5 - 101
6 - 110
7 - 111
Now create a b-tree with each node having 2-child (like Huffmann coding algo). Each leaf will contain the frequency of each number (initially it would be 0 for all). And address of these nodes will be saved in an array, with key as index (i.e. address for Node 1 will be at index 1 in array).
With pre-processing, we can execute insert, remove in O(1), mode in O(M) time.
insert(x) - go to location k in array, get address of node and increment counter for that node.
delete(x) - as above, just decrement counter if>0.
mode - linear search in array for maximum frequency (value of counter).
I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)
If I have a sorted set of data, which I want to store on disk in a way that is optimal for both reading sequentially and doing random lookups on, it seems that a B-Tree (or one of the variants is a good choice ... presuming this data-set does not all fit in RAM).
The question is can a full B-Tree be constructed from a sorted set of data without doing any page splits? So that the sorted data can be written to disk sequentially.
Constructing a "B+ tree" to those specifications is simple.
Choose your branching factor k.
Write the sorted data to a file. This is the leaf level.
To construct the next highest level, scan the current level and write out every kth item.
Stop when the current level has k items or fewer.
Example with k = 2:
0 1|2 3|4 5|6 7|8 9
0 2 |4 6 |8
0 4 |8
0 8
Now let's look for 5. Use binary search to find the last number less than or equal to 5 in the top level, or 0. Look at the interval in the next lowest level corresponding to 0:
0 4
Now 4:
4 6
Now 4 again:
4 5
Found it. In general, the jth item corresponds to items jk though (j+1)k-1 at the next level. You can also scan the leaf level linearly.
We can make a B-tree in one pass, but it may not be the optimal storage method. Depending on how often you make sequential queries vs random access ones, it may be better to store it in sequence and use binary search to service a random access query.
That said: assume that each record in your b-tree holds (m - 1) keys (m > 2, the binary case is a bit different). We want all the leaves on the same level and all the internal nodes to have at least (m - 1) / 2 keys. We know that a full b-tree of height k has (m^k - 1) keys. Assume that we have n keys total to store. Let k be the smallest integer such that m^k - 1 > n. Now if 2 m^(k - 1) - 1 < n we can completely fill up the inner nodes, and distribute the rest of the keys evenly to the leaf nodes, each leaf node getting either the floor or ceiling of (n + 1 - m^(k - 1))/m^(k - 1) keys. If we cannot do that then we know that we have enough to fill all of the nodes at depth k - 1 at least halfway and store one key in each of the leaves.
Once we have decided the shape of our tree, we need only do an inorder traversal of the tree sequentially dropping keys into position as we go.
Optimal meaning that an inorder traversal of the data will always be seeking forward through the file (or mmaped region), and a random lookup is done in a minimal number of seeks.