What are the differences between B-tree and B*-tree, except the requirement for fullness? - algorithm

I know about this question, but it's about B-tree and B+-tree. Sorry, if there's similar for B*-tree, but I couldn't find such.
So, what is the difference between these two trees? The wikipedia article about B*-trees is very short.
The only difference, that is noted there, is "non-root nodes to be at least 2/3 full instead of 1/2". But I guess there's something more.. There could be just one kind of tree - the B-tree, just with different constants (for the fullness of each non-root node), and no two different trees, if this was the only difference, right?
Also, one more thing, that made me thing about more differences:
"A B*-tree should not be confused with a B+ tree, which is one where the
leaf nodes of the tree are chained together in the form of a linked list"
So, B+-tree has something really specific - the linked list. What is the specific characteristic of B*-tree, or there isn't such?
Also, there are no any external links/references in the wikipedia's article. Are there any resources at all? Articles, tutorials, anything?
Thanks!

Edited
No difference other than min. fill factor.
Page #489
From the above book,
B*-tree is a variant of a B-tree that requires each internal node to be at least 2/3 full, rather than at least half full.
Knuth also defines the B* tree exactly like that (The art of computer programming, Vol. 3).
"The Ubiquitous B-Tree" has a whole sub-section on B*-trees. Here, Comer defines the B*-tree tree exactly as Knuth and Corment et al. do but also clarifies where the confusion comes from --B*-tree tree search algorithms and some unnamed B tree variants designed by Knuth which are now called B+-trees.

Maybe you should look at Ubiquitous B-Tree by Comer (ACM Computing Surveys, 1979).
Comer writes there something about the B*Tree (In the section B-Tree and its variants). And in that section, he also cites some more paper about that topic. That should help you to do further investigations on your own :)! (I'm not your researcher ;) )
However, I don't understand the point where you cite a part which says that the B*Tree does not have a linked list in the leaf node level. I'm pretty sure, that also those nodes are linked together.
Regarding having only one B-Tree. Actually, you have that. The other ones like B+Tree, Prefix B+Tree and so on are just variants of the standard B-Tree. Just look at the paper Ubiquitous B-Tree.

Related

Decision Tree clarification

I just want to ask/clarify if decision trees are essentially binary trees where each node is a boolean, and it continues down until a desired result is reached?
Not necessarily. Some nodes may share children which is not the case in Binary trees. However, the essence of the decision tree is what you mentioned.
It's a tree where based on the probability of an outcome you move down the graph until you hit an outcome.
See Wikipedia's page on desicion trees for more info.
As mentioned by Ares, not all decision trees are binary (they can be "n-ary") although most implementation I have seen are binary trees.
For instance if you have a color variable (i.e. categorical) that can take three values : red, blue or green; you might want to split in three directly at a node instead of splitting in two and then again in two (or more).
The choice between binary and "n-ary" will usually depends on your data. I suspect that most people use binary trees anyway because it is relatively easier to implement and more flexible.
Then as you said the tree is developed until the desired outcome is reached. Decision Tree suffers major drawbacks such as overfitting and there exist many different ways to tackle this issue (pruning, boosting, etc.) but this is beyond the scope of the question/answer.
I recommend to have a look at this great visualization that explains well the decision tree : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Will be happy to give more details about decision tree

Data structure: a graph that's similar to a tree - but not a tree

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!
I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

How do you remove an element from a b-tree?

I'm trying to learn about b-tree and every source I can find seem to omits the discussion about how to remove an element from the tree while preserving the b-tree properties.
Can someone explain the algorithm or point me to resource that do explain how it's done?
There's an explanation of it on the Wikipedia page. B-tree - Deletion
If you haven't got it yet, I strongly recommend Carmen & al Introduction to Algorithms 3rd Edition.
It is not described because the operations naturally stem from the B-Tree properties.
Since you have a lower-bound on the number of elements in a node, if removing your elements violates this invariant, then you need to restore it, which generally involves merging with a neighbour (or stealing some of its elements).
If you merge with a neighbour, then you need to remove an element in the parent node, which triggers the same algorithm. And you apply recursively till you get to the top.
B-Tree don't have rebalancing (at least not those I saw) so it's far less complicated that maintaining a red-black tree or an AVL tree which is probably why people didn't feel compelled to write about the removal.
About which b-trees are you talking about? With linked leaves or not? Also, there are different ways of removing an item (top-bottom, bottom-top, etc.). This paper might help: B-trees, Shadowing, and Clones (even though there are many file-system specific related stuff).
The deletion example from CLRS (2nd edition) is available here: http://ysangkok.github.io/js-clrs-btree/btree.html
Press "Init book" and then push the deletion buttons in order. That will cover all cases. Try and predict the new tree state before pushing each button, and try to recognize how the cases are all unique.

Finding the most frequent subtrees in a collection of (parse) trees

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.
EDIT #mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".
<sentence>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
<verbPhrase>
<verb/>
<prepositionPhrase>
<preposition/>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
</prepositionPhrase>
</verbPhrase>
</sentence>
Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:
<nounPhrase>
<article/>
<adjective/>
<adjective/>
<noun/>
</nounPhrase>
The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.
Obviously the shorter the tree the more frequent, so nounPhrase will be very common.
EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).
Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.
That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.
It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).
I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like
(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)
Then your problem becomes LCS.
Wikibooks has a java implementation of LCS here
This is a well-known problem in computer science, for which there are efficient solutions.
Here are some relevant references:
Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.
Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.
Or, if you just want fast code, go here:
FREQT
(transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)
I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/

Resources