Can an m-way B-Tree have m odd? - algorithm

I read on book CLRS that we have m-way B-tree where m is even. But is there is B-Tree where m is odd, if there is then how can we make changes in the code given in this book.

By an m-way B-tree I assume you mean a B-tree where each internal node is allowed to have at most m children. According to CLRS's definition of a B-tree:
Nodes have lower and upper bounds on the number of keys they can contain. We express these bounds in terms of a fixed integer t 􏰄≥ 2 called the minimum degree of the B-tree: ... an internal node may have at most 2t children.
So the maximum number of children will always be even – by this definition it can not be odd.
However, this is not the only definition of B-tree. There are many definitions with slight differences that ultimately, make little difference to the overall performance. This can cause confusion. There are some B-tree definitions that allow for odd upper bounds and those which don't. CLRS's definition does not odd upper bounds for the children count of internal nodes.
However, another formal definition of a B-tree is by Knuth [1998] (The Art of Computer Programming, Volume 3 (Second ed.), Addison-Wesley, ISBN 0-201-89685-0). Knuth's definition does allow odd upper bounds. While CLRS enumerates all min-max tree bounds of the form (t, 2t) for t ≥ 2, Knuth enumerates all tree bounds of the form (ceil(k/2), k) for k ≥ 2.
Knuth Order, k | (min,max) | CLRS Degree, t
---------------|-------------|---------------
0 | - | –
1 | – | –
2 | – | –
3 | (2,3) | –
4 | (2,4) | t = 2
5 | (3,5) | –
6 | (3,6) | t = 3
7 | (4,7) | –
8 | (4,8) | t = 4
9 | (5,9) | –
10 | (5,10) | t = 5
So for example, a 2-3 tree, (2,3), is a B-tree with Knuth order 3. But it is not a valid CLRS tree because it has an odd upper bound.
Changing code will not be easy as B-trees have a lot of code depending on variable t. One of the biggest changes would be inside: B-TREE-SPLIT-CHILD(x,i), you'd need find a way to split a child with an odd number of children (an even number of keys) into nodes y and z. One of these two resulting nodes will have one more key than the other. If you're looking for code, I'd recommend looking on the Internet for an implementation of a B-tree that uses a definition similar to Knuth's (e.g. search for "Knuth Order B-tree").

Related

Optimal Search Tree : Calculate the cost of the search tree and show that it is not optimal

Consider the following binary search tree, along with the following frequencies of lookups:
13 Key | 13 | 11 | 26 | 1 | 12 | 28
/ \ -----------------------------------------
11 26 Frequency | 26 | 5 | 25 | 1 | 3 | 15
/ \ \
1 12 28
I was given this question:
Calculate the cost of the search tree above for the given search
frequencies and show that the search tree isn't optimal for the given
search frequencies.
I calculated the cost, but my teacher say I did so incorrectly, but did not explain why.
So what we need to do for calculate cost is check where the first node is. 13 is in level 1 and the frequency of 13 is 26. So we do 26*1=26
The nodes 11 and 26 are in level 2, the nodes 1, 12 and 28 are in level 3.
In the end we have for cost: 26*1 + 5*2 + 25*2 + 1*3 + 3*3 + 15*3. My teacher says that this calculation is incorrect, but didn't explain why.
Also, how do you show that a tree isn't optimal? Here is definition I have from our script:
Let K be a set of keys and R a workload. A search tree T over K is optimal for R iff P(T) = min{P(T') | T' is search tree for K}
#templatetypedef Thank you very much for take your time and help!! Your answer is very nice for me I understand many things from it. Here is tree I found it is more optimal than this tree from task:
26
/ \
13 28
/
11
/ \
1 12
Tree above have cost of 143 and this one have 138. So this one is really more optimal and task is solved :)
Fundamentally, you're approaching the question of calculating the total lookup time in a BST correctly. You're taking each node in the tree, using the depth to determine the number of comparisons necessary to perform a lookup that ends at that node, multiplying those values by the number of lookups, and summing the results. I didn't meticulously double-check your exact calculations and so it's possible that you missed something, though.
Your second question was about determining whether a binary search tree is optimal for a given set of lookups. You've given the rigorous mathematical definition, but in this case I think it might be a bit easier to explain this at a higher level.
The calculation you did earlier here is a way of starting with a BST and information about what lookups will be performed, then computing a number corresponding to the number of comparisons that will end up being made in the course of performing those lookups. That number essentially tells you how fast those lookups are going to be - higher numbers mean that the lookups take longer, and lower numbers mean that the lookups will take less time.
Now, imagine that you want to make a BST that will take the least total amount of time to perform the lookups in question. In other words, you want the "best" BST for the given set of keys and lookup frequencies. That BST would be one with the lowest total lookup cost, where that cost is calculated using the approach you worked through earlier on. The terminology for a BST with that property - that it has the best lookup speed for those frequencies among all the possible BSTs that you can make - is an optimal BST.
The question here is to show that the tree you have isn't optimal. That means that you need to show that this isn't the best possible tree you can make. One way to do this would be to find an even better tree. So can you find another BST with the same keys where the total lookup time is lower than the one you were given?
Good luck!

Deriving recurrence relations from a paragraph

I've recently gotten back into learning about discrete math. I enrolled in a course at university and am having trouble getting the hang of things again, especially when it comes to deriving a recurrence relation from a word problem. I would love to have some tips on how to do so.
For example (I've changed numbers from the homework question so if this doesn't work out just let me know): if Jean divides an input of size n into three subsets each of size n/5 and combines them in theta(n) time, what is the runtime? I got 3T(n/5) + theta(n) as the recurrence relation and I have no idea what the runtime is, and I feel like those are both incorrect.
I found this site (https://users.cs.duke.edu/~ola/ap/recurrence.html) to be helpful for breaking down a recurrence relation into a solid runtime, but I still don't get how to get the recurrence relation from the word problem in the first place. Thanks!
Think of such problems as tree structure with nodes on each level.Each node will contain a number the size of problem that you are dealing at a particular time and at each level you will have some nodes. This number could be 1,2,.......upto maximum n nodes at each level.
Now start from top level. You will have 1 node and the value in it will be 'n'(because at starting point we will have 'n' elements to deal with).
Now coming down to second level. In the above question says divide the problem(elements) in three parts at any point of time, so number of nodes on level 2 will be 3. The value in each node will be 'n/5'(because question says size of each subset is number of elements which were present on parent node divide by 5).Tree will look like:-
(n)
| | |
(n/5)(n/5) (n/5)
Now going further down to 3rd level, tree will look like
(n) level(1)
| | |
(n/5) (n/5) (n/5) level(2)
| | | | | | | | |
(n/25)(n/25)(n/25) (n/25)(n/25)(n/25) (n/25)(n/25)(n/25) level(3)
you will go on till the last level which will contain only 1 element and total number of nodes will be 'n'.
So, if you need to write the recursion just see level 1 and level 2.
Time taken to solve problem with 'n1' element is written as T(n1).
Time taken to solve problem with 'n2' element is written as T(n2).
Now number of elements in level 1 is n1=n
(Time taken to solve problem on first level)=(Time taken to solve 1st
node of level 2)+(Time taken to solve 2nd node of level 2)+(Time taken
to solve last node of level 2) + (It also takes time to combine these
three nodes given in question i.e. theta(number of total elements(n))
T(n)=T(n/5)+T(n/5)+T(n/5)+theta(n)
=>T(n)=3T(n/5)+theta(n)

How to represent molecules and compare equality

I've seen this question about the representation of molecules in memory, and it makes sense to me (tl;dr represent it as a graph with atoms as nodes and bonds as edges). But now my question is this: how do we check and see if two molecules are equal? This could be generalized as how can we check equality of (acyclic) graphs? For now we'll ignore stereoisomers and cyclical structures, such as the carbon ring in the example given in the first link.
Here's a more detailed description of my problem: For my Molecule class (as of now), I intend to have an array of Atoms and an array of Bonds. Each Bond will point to the two Atoms at either end, and will have a weight (i.e., the number of chemical bonds in that edge). In other words, this will most closely resemble an edge list graph. My first guess is to iterate over the Atoms in one molecule and try to find corresponding Atoms in the other molecule based on the Bonds that contain that Atom, but this is a rather naive approach, and the complexity seems pretty large (best guess is close to O(n!). Yikes.).
Regardless of complexity, this approach seems like it would work in most cases, however it seems to break down for some molecules. Take these for example (notice the different location of the OH group):
H H H OH H
| | | | |
H - C - C - C - C - C - H (2-Pentanol)
| | | | |
H H H H H
H H OH H H
| | | | |
H - C - C - C - C - C - H (3-Pentanol)
| | | | |
H H H H H
If we examine these molecules, for each atom in one molecule there is a unique same-element atom in the other molecule that has the same number and types of bonds, but these two molecules are clearly not the same, nor are they stereoisomers (which I'm not considering now). Instead they are structural isomers. Is there a way that we can check this relative structure as well? Would this be easier with an adjacency list instead of an edge list? Are there any graph equality algorithms out there that I should look into (ideally in Java)? I've looked a bit into graph canonization, but this seems like it could be NP-hard.
Edit: Looking at the Graph Isomorphism Problem Wikipedia Article, it seems as if graphs with bounded degree have polynomial time solutions to this problem. Furthermore, planar graphs also have polynomial solutions (i.e., the edges only intersect at their endpoints). It seems to me that molecules satisfy both of these conditions, so what is this polynomial-time solution to this problem, or where can I find it? My Google searches are letting me down this time.
If the graphs are acyclic, then it is a tree isomorphism problem, which has a pretty straightforward solution.
For now let's assume all internal nodes are carbon and all edges are the same (later on how to relax this restriction.)
Represent leaf nodes as numbers - say their atomic number. Represent trees of height 1 as sorted lists of their leaf nodes, so:
H Cl
| |
H - C - H and Cl-C-Cl
| |
H H
are [1,1,1,1] and [1,17,17,17] respectively. Obviously two molecules are isomorphic iff the sorted lists are the same.
This generalizes to trees of larger heights - represent a tree of height n as a list of representations of its subtrees, sorted lexigoraphically, so
Cl H H H
| | | |
H - C -C-Cl and Cl- C - C - Cl
| | | |
Cl H H Cl
are both [[1,1,17],[1,17,17]]. Two trees are isomorphic iff their representations are.
Note: usually the tree isomorphism algorithms work on rooted trees. Here we just go recursively from leaves towards the center of the graph which sometimes leaves us with two "roots".
H H Cl
| | |
H - C - C - C - H
| | |
H H H
Here, the left C is [1,1,1], the right C is [1,1,17]. The middle C (which is the root here) has these two lists plus two leaves. Sorted lexicographically it's [1,1,[1,1,1],[1,1,17]].
Now for representing internal nodes that aren't C - you can just simulate them by attaching a fake leaf with a special number, so
H
|
H - C - O - H
|
H
Can be encoded as
H
|
H - C - C - H
| |
H Fake
Where the "Fake" can be, say, 511 so that we know it doesn't clash with any existing atom. The whole molecule will thus be [[1,1,1],[1,511]].
So the algorithm is:
Convert both molecules to the recursively lexicographically sorted list form.
Check if the representations are equal.
#Rafal discussed the case of trees. But what if you do not have trees? here is my two cents:
Mathematica approach
Mathematica has a built-in predicate to check whether two graphs are isomorphic. You can try it for 30 days if you do not have it.
Check nauty
nauty is a solver where you can download it and test isomorphic.
Detect true negatives in advance
You can detect true negatives in advance by simply computing and comparing some numbers/sequences. This includes computing the degree sequence the vertex and edge set degrees. A pair of graphs passing this does not necessarily mean they are isomorphic but will reduce your space (maybe drastically !).
Most importantly, there is a recent advancement of the problem stating that isomorphic tests are polynomial for graphs of bounded treewidth. Even if your graphs seems general, they may exhibit this property (or you can simply assume it in general).

Why doesn't the distribution of inversions matter in insertion sort?

According to Robert Sedwick, shell sort (supposed to run faster than insertion sort) tries to minimize the inversion distance with different h-sortings.
In a way , this h-sorting procedure makes file nearly sorted hence rearrange inversion distribution in more symmetric way.
Then how can one say (according to book), insertion sort run time depends on number of inversions & not on their distribution manner?
In insertion sort, every swap that's made reduces the number of inversions by exactly one. Imagine, for example, that we're about to swap two adjacent elements B and A in insertion sort. Right before the swap, the array looks something like this:
+--------------+---+---+------------+
| before | B | A | after |
+--------------+---+---+------------+
And, right afterwards, it looks like this:
+--------------+---+---+------------+
| before | A | B | after |
+--------------+---+---+------------+
Now, think about the inversions in the array. Any inversion purely in "before" or "after" is still there. Every inversion from "before" into "after" is still there, as are inversions from "before" into A, "before" into B, A into "after," and B into "after." The only inversion that's gone is the specific inversion pair (A, B). Consequently, the number of swaps in insertion sort is exactly equal to the number of inversions, since each inversion requires one swap and the algorithm stops when no inversions are left. Notice that it's just the total number of inversions that matters, not where they are.
On the other hand, this is not true about shellsort. Suppose in shellsort that we swap elements B and A, which are out of place but not adjacent. Schematically, right before the swap we have something like this:
+--------------+---+----------+---+------------+
| before | B | middle | A | after |
+--------------+---+----------+---+------------+
And we end with this:
+--------------+---+----------+---+------------+
| before | A | middle | B | after |
+--------------+---+----------+---+------------+
The inversion (B, A) is now gone, but it's also quite possible that even more inversions were eliminated with this step. For example, suppose there are a bunch of elements in "middle" that are less than B. That single swap would then eliminate all of them at the same time.
Because each swap in shellsort can potentially eliminate multiple inversions, the actual locations of those inversions does actually matter for the runtime, not just their position.
Not an answer per se: shell sort actually does require fewer comparisons on average than insertion sort and possibly all other sort algorithms PROVIDED you supply it with the correct gap sequence that, in turn, is a function of n (the number of elements to be sorted). There is probably just one (unique) optimal gap sequence for every n.
Defining F(n) is, of course, the tricky part!

How Balanced are balanced B-Trees

Say I have a B-Tree with nodes in a 3-4 configuration (3 elements and 4 pointers). Assuming I build up my tree legally according to the rules, is it possible for me to reach a situation where there are two nodes in a layer and one node has 4 exiting pointers and the other has only two exiting pointers?
In general, what guarantees do I have as to the balancedness of a properly used B-Tree
The idea behind balance (in general balanced tree data structures) is that the difference in depths between any two sub-trees is zero or one (depending on the tree). In other words, the number of comparisons used to find a leaf node is always similar.
So yes, you can end up in the situation you describe, simply because the depths are the same. The number of elements in each node is not a concern to the balance (but see below).
This is perfectly legal even though there's more items in the left node than the right (null pointers are not shown):
+---+---+---+
| 8 | | |
+---+---+---+
________/ |
/ |
| |
+---+---+---+ +---+---+---+
| 1 | 2 | 3 | | 9 | | |
+---+---+---+ +---+---+---+
However, it's very unusual to have a 3-4 BTree (some would actually say that's not a BTree at all, but some other data structure).
With BTrees, you usually have an even number of keys as maximum in each node (e.g., a 4-5 tree) so that the splitting and combining is easier. With a 4-5 tree, the decision as to which key is promoted when a node fills up is easy - it's the middle one of the five. That's not such a clear-cut matter with a 3-4 tree since it could be one of two (there is no definite middle for four elements).
It also allows you to follow the rule that your nodes should contain between n and 2n elements. In addition (for "proper" BTrees), the leaf nodes are all at the same depth, not just within one of each other.
If you added the following values to an empty BTree, you could end up with the situation you describe:
Add Tree Structure
--- ----------------
1 1
2 1,2
5 1,2,5
6 1,2,5,6
7 5
/ \
1,2 6,7
8 5
/ \
1,2 6,7,8
9 5
/ \
1,2 6,7,8,9

Resources