Finding the most frequent subtrees in a collection of (parse) trees - algorithm

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.
EDIT #mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".
<sentence>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
<verbPhrase>
<verb/>
<prepositionPhrase>
<preposition/>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
</prepositionPhrase>
</verbPhrase>
</sentence>
Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:
<nounPhrase>
<article/>
<adjective/>
<adjective/>
<noun/>
</nounPhrase>
The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.
Obviously the shorter the tree the more frequent, so nounPhrase will be very common.
EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.
That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.
It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).

I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like
(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)
Then your problem becomes LCS.
Wikibooks has a java implementation of LCS here

This is a well-known problem in computer science, for which there are efficient solutions.
Here are some relevant references:
Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.
Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.
Or, if you just want fast code, go here:
FREQT
(transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)

I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/

Related

Counting isomeric n-carbon aliphatic alkanes

An n-carbon aliphatic alkane is an unrooted tree consisting of n nodes where the degree of each node is atmost 4. As an example, see this for a list of the enumeration of some low values of n.
I am looking for an algorithm to compute the number of such n-carbon aliphatic alkanes, given an n.
I have seen this in chemistry stackexchange already. I have also thought of dynamic programming, i.e, building larger graphs from smaller components, but I cannot deal with overcounting the same isomers.
Clarification: The Carbons are just a metaphor. I do not wish to take into account the instability of C16 and C17, nor do I care about stereoisomers
So the standard approach is to use the Redfield–Pólya Theorem also known as the Pólya enumeration theorem. However it is not very 'algorithmic' - you have code like this (the Mathematica, Haskell, or one of the Python versions).
The rosettacode page also describes a more direct approach using canonical checking to avoid duplicates. The algorithm is a specialised form of orderly generation (I think) that only works for trees without vertex of edge colors and a maximum valence of 4.

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Confusion regarding PATRICIA [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
According to points 3 and 4 of libstdc++ documentation, PATRICIA tries have two types of nodes:
A (PATRICIA) trie is similar to a tree, but with the following
differences:
It explicitly views keys as a sequence of elements. E.g., a trie can view a string as a sequence of characters; a trie can view a
number as a sequence of bits.
It is not (necessarily) binary. Each node has fan-out n + 1, where n is the number of distinct elements.
It stores values only at leaf nodes.
Internal nodes have the properties that A) each has at least two children, and B) each shares the same prefix with any of its
descendant.
The book I've been reading (Algorithms in C, Parts 1-4 by Robert Sedgewick) seems to describe a PATRICIA trie storing n values with only n nodes, using internal nodes to store values:
Like DSTs, patricia tries allow search for N keys in a tree with just
N nodes. ... we avoid external nodes via another simple device: we
store data in internal nodes and replace links to external nodes with
links that point back upwards to the correct internal node in the trie
It seems there are two camps of belief here:
On the one hand we have a strict, specific definition (i.e. Sedgewick, Knuth, Morrison who all seem to describe PATRICIA exclusively as a prefix-compressed binary tree with one-way branching eliminated); and
Then we have those believing the term forms a loose, vague definition which seems more like they meant to use a word like "map", "dictionary" or "trie" (which are all actually loosely defined, i.e. the libcstd++ documentation).
I guess I'm concerned about the accuracy of my resources. As I understand, due to problems introduced by common prefixes, it isn't possible to represent a tree with just N nodes without presenting it as a binary tree (which seems to violate point 2 of libcstd++ docs, and point 4 when dealing with variable-width keys), and without losing the notion of strict one-way branching (violation of points 3 and 4 by rendering the concept of "leaf nodes" and "children" somewhat invalid). The two features work in tandem to eliminate the dilemma that is "internal nodes" that would cause such trees to use more than N nodes (recall: N items with N just nodes).
These two groups of references can't both be correct; there's too much mutual exclusion. Where one reference says PATRICIA is binary and another says it might not be, they can't both be considered factually correct, and that's just one example of inconsistency I see here. Which of these references are correct?
I continued to search for a specific definition from past reputable sources to confirm what I had suspected, and I'm writing to provide my findings. Perhaps the most significant is the official paper defining PATRICIA, published by DR Morrison in October 1968s "Journal of the ACM":
PATRICIA evolved from "A Library Automaton" [3] and other studies. ...
Early in this evolution it was decided that the alphabet should be
restricted to a binary one. A theorem which strongly influenced this
decision is one which, in another form, is due to Euler. The theorem
states that if the alphabet is binary, then the number of branches is
exactly one less than the number of ends. Corollaries state that as
the library grows, each new end brings into the library with it
exactly one new branch, and each branch has exactly two exits. These
facts are very useful in the allocation of storage for the index. They
imply that the total storage required is completely determined by the
number of ends, and all of the storage required will actually be used.
This certainly contradicts points 2 and 3 of the libstdc++ reference. There's further evidence in this paper, such as specific algorithm details, but the quote above should suffice.
There don't appear to be any deviations from the official description in the Sedgewick quote, however. Based on that, the libstdc++ resource is certainly less valid than the Sedgewick resource.
Although both definitions seem to be correct, first one is more detailed and seems better to me. Also have a look at this answer, where I try to depict the difference between a Patricia and regular Trie.

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Resources