Confusion regarding PATRICIA [closed] - data-structures

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
According to points 3 and 4 of libstdc++ documentation, PATRICIA tries have two types of nodes:
A (PATRICIA) trie is similar to a tree, but with the following
differences:
It explicitly views keys as a sequence of elements. E.g., a trie can view a string as a sequence of characters; a trie can view a
number as a sequence of bits.
It is not (necessarily) binary. Each node has fan-out n + 1, where n is the number of distinct elements.
It stores values only at leaf nodes.
Internal nodes have the properties that A) each has at least two children, and B) each shares the same prefix with any of its
descendant.
The book I've been reading (Algorithms in C, Parts 1-4 by Robert Sedgewick) seems to describe a PATRICIA trie storing n values with only n nodes, using internal nodes to store values:
Like DSTs, patricia tries allow search for N keys in a tree with just
N nodes. ... we avoid external nodes via another simple device: we
store data in internal nodes and replace links to external nodes with
links that point back upwards to the correct internal node in the trie
It seems there are two camps of belief here:
On the one hand we have a strict, specific definition (i.e. Sedgewick, Knuth, Morrison who all seem to describe PATRICIA exclusively as a prefix-compressed binary tree with one-way branching eliminated); and
Then we have those believing the term forms a loose, vague definition which seems more like they meant to use a word like "map", "dictionary" or "trie" (which are all actually loosely defined, i.e. the libcstd++ documentation).
I guess I'm concerned about the accuracy of my resources. As I understand, due to problems introduced by common prefixes, it isn't possible to represent a tree with just N nodes without presenting it as a binary tree (which seems to violate point 2 of libcstd++ docs, and point 4 when dealing with variable-width keys), and without losing the notion of strict one-way branching (violation of points 3 and 4 by rendering the concept of "leaf nodes" and "children" somewhat invalid). The two features work in tandem to eliminate the dilemma that is "internal nodes" that would cause such trees to use more than N nodes (recall: N items with N just nodes).
These two groups of references can't both be correct; there's too much mutual exclusion. Where one reference says PATRICIA is binary and another says it might not be, they can't both be considered factually correct, and that's just one example of inconsistency I see here. Which of these references are correct?

I continued to search for a specific definition from past reputable sources to confirm what I had suspected, and I'm writing to provide my findings. Perhaps the most significant is the official paper defining PATRICIA, published by DR Morrison in October 1968s "Journal of the ACM":
PATRICIA evolved from "A Library Automaton" [3] and other studies. ...
Early in this evolution it was decided that the alphabet should be
restricted to a binary one. A theorem which strongly influenced this
decision is one which, in another form, is due to Euler. The theorem
states that if the alphabet is binary, then the number of branches is
exactly one less than the number of ends. Corollaries state that as
the library grows, each new end brings into the library with it
exactly one new branch, and each branch has exactly two exits. These
facts are very useful in the allocation of storage for the index. They
imply that the total storage required is completely determined by the
number of ends, and all of the storage required will actually be used.
This certainly contradicts points 2 and 3 of the libstdc++ reference. There's further evidence in this paper, such as specific algorithm details, but the quote above should suffice.
There don't appear to be any deviations from the official description in the Sedgewick quote, however. Based on that, the libstdc++ resource is certainly less valid than the Sedgewick resource.

Although both definitions seem to be correct, first one is more detailed and seems better to me. Also have a look at this answer, where I try to depict the difference between a Patricia and regular Trie.

Related

What are some good resources for learning backtracking, branch-and-bound and dynamic programming algorithms? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
CLRS doesn't seem to cover bactracking/branch-and-bound. I tried several resources online, though I get the idead behind these, I am unable to write code for, let's say, Knapsack problem. So, I want something that, may be, takes a problem and solves it with these 3 approaches and at least gives pseudo-code.
Or any resources that you thing will be helpful.
In algorithms which use backtracking, branch/bound etc. there is typically the concept of the solution space, and the search space. The goal of the algorithm is to traverse the search space to reach a point in the solution space, and often a point which is considered optimal by some metric, or establish that the solution space is empty (without visiting every element in the search space).
The first step is to define a mechanism to express an element in the search space in an efficient format. The representation should allow you to express what elements form a solution space, a way to evaluate the quality of element by the metric used to measure, a way to determine the neighboring elements you can reach from a current state and so on.
Typically these algorithms will traverse the search space till the find a solution, or will exit if no solution will exist. The traversal happens by visiting a series of points, often in parallel to explore the search space. This is the branch aspect; you are making decisions to visit certain parts of the search space.
During the traversal of the search space they may determine that a particular path is not worth it so they may decide that they would not explore the part of the search space reachable from the path. This is very bounding aspect.
Very often the traversal of the space is done by using partial solutions. For example if you have a search space represented by eight bits, you might assign fixed values to two specific bits out of the eight bits, and then search for a desirable solution for the space represented by the remaining six bits. You might discover that a particular assignment of the two bits to a fixed value leads to a situation where no solution can exist (or the quality of the solution is very poor). You can then bound the search space so that the algorithm does not explore any more elements in that sub-space defined by assigning a particular fixed value to those two bits.
For backtracking based systems the pseudo code is trivial. The challenge lies in finding efficient representation to represent the search space, representing partial solutions, finding out the validity of a particular solution, coming up with rules to determine which path to take up front, developing metrics to measure the quality of solution, figuring out when to backtrack, or how far to backtrack and so on...
StateStack.push(StartState)
loop{
curState = StateStack.top
nextState = calculateNextState(curState)
StateStack.push(nextState)
if(reachedFinalGoal(nextState)){
break;
}
if(needToBackTrack(StateStack)){
curState = nextState
stateToBackTrackTo = calculateStateToBackTrackTo(stateStack)
while(curState != stateToBackTrackTo){
stateToGoBackTo = stateStack.pop
curState = RollBackToState(stateToGoBackTo)
}
}
These are search techniques, rather than algorithms. To start with, you should clearly understand what the search space is. E.g. in the case of a Knapsack problem that would consist of all the possible subsets of available objects. Sometimes there are constraints that define which solutions are valid and which are not, for example those sets of objects that exceed the total volume of the knapsack are not valid. You also should have the clearly defined objective (the total worth of the selected objects here).
Wikipedia contains a pretty accurate description of the Branch and Bound, actually. It's rather high-level, but any description that is more detailed will require assumptions about the structure of the search space. For backtracking there's even some pseudo-code, but again very general.
An alternative (and probably better) approach is to find example applications of these techniques and study those. There's at least a couple of algorithms involving DP in CLRS and you can surely google up more if you need.

What are the differences between B-tree and B*-tree, except the requirement for fullness?

I know about this question, but it's about B-tree and B+-tree. Sorry, if there's similar for B*-tree, but I couldn't find such.
So, what is the difference between these two trees? The wikipedia article about B*-trees is very short.
The only difference, that is noted there, is "non-root nodes to be at least 2/3 full instead of 1/2". But I guess there's something more.. There could be just one kind of tree - the B-tree, just with different constants (for the fullness of each non-root node), and no two different trees, if this was the only difference, right?
Also, one more thing, that made me thing about more differences:
"A B*-tree should not be confused with a B+ tree, which is one where the
leaf nodes of the tree are chained together in the form of a linked list"
So, B+-tree has something really specific - the linked list. What is the specific characteristic of B*-tree, or there isn't such?
Also, there are no any external links/references in the wikipedia's article. Are there any resources at all? Articles, tutorials, anything?
Thanks!
Edited
No difference other than min. fill factor.
Page #489
From the above book,
B*-tree is a variant of a B-tree that requires each internal node to be at least 2/3 full, rather than at least half full.
Knuth also defines the B* tree exactly like that (The art of computer programming, Vol. 3).
"The Ubiquitous B-Tree" has a whole sub-section on B*-trees. Here, Comer defines the B*-tree tree exactly as Knuth and Corment et al. do but also clarifies where the confusion comes from --B*-tree tree search algorithms and some unnamed B tree variants designed by Knuth which are now called B+-trees.
Maybe you should look at Ubiquitous B-Tree by Comer (ACM Computing Surveys, 1979).
Comer writes there something about the B*Tree (In the section B-Tree and its variants). And in that section, he also cites some more paper about that topic. That should help you to do further investigations on your own :)! (I'm not your researcher ;) )
However, I don't understand the point where you cite a part which says that the B*Tree does not have a linked list in the leaf node level. I'm pretty sure, that also those nodes are linked together.
Regarding having only one B-Tree. Actually, you have that. The other ones like B+Tree, Prefix B+Tree and so on are just variants of the standard B-Tree. Just look at the paper Ubiquitous B-Tree.

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.

How to implement a graph-structured stack?

Ok, so I would like to make a GLR parser generator. I know there exist such programs better than what I will probably make, but I am doing this for fun/learning so that's not important.
I have been reading about GLR parsing and I think I have a decent high level understanding of it now. But now it's time to get down to business.
The graph-structured stack (GSS) is the key data structure for use in GLR parsers. Conceptually I know how GSS works, but none of the sources I looked at so far explain how to implement GSS. I don't even have an authoritative list of operations to support. Can someone point me to some good sample code/tutorial for GSS? Google didn't help so far. I hope this question is not too vague.
Firstly, if you haven't already, you should read McPeak's paper on GLR http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps. It is an academic paper, but it gives good details on GSS, GLR, and the techniques used to implement them. It also explains some of the hairy issues with implementing a GLR parser.
You have three parts to implementing a graph-structured stack.
I. The graph data structure itself
II. The stacks
III. GLR's use of a GSS
You are right, google isn't much help. And unless you like reading algorithms books, they won't be much help either.
I. The graph data structure
Rob's answer about "the direct representation" would be easiest to implement. It's a lot like a linked-list, except each node has a list of next nodes instead of just one.
This data structure is a directed graph, but as the McPeak states, the GSS may have cycles for epsilon-grammars.
II. The stacks
A graph-structured stack is conceptually just a list of regular stacks. For an unambiguous grammar, you only need one stack. You need more stacks when there is a parsing conflict so that you can take both parsing actions at the same time and maintain the different state both actions create. Using a graph allows you to take advantage of the fact that these stacks share elements.
It may help to understand how to implement a single stack with a linked-list first. The head of the linked list is the top of the stack. Pushing an element onto the stack is just creating a new head and pointing it to the old head. Popping an element off the stack is just moving the pointer to head->next.
In a GSS, the principle is the same. Pushing an element is just creating a new head node and pointing it to the old head. If you have two shift operations, you will push two elements onto the old head and then have two head nodes. Conceptually this is just two different stacks that happen share every element except the top ones. Popping an element is just moving the head pointer down the stack by following each of the next nodes.
III. GLR's use of the GSS
This is where McPeak's paper is a useful read.
The GLR algorithm takes advantage of the GSS by merging stack heads that have the same state element. This means that one state element may have more than one child. When reducing, the GLR algorithm will have to explore all possible paths from the stack head.
You can optimize GLR by maintaining the deterministic depth of each node. This is just the distance from a split in the stack. This way you don't always have to search for a stack split.
This is a tough task! So good luck!
The question that you're asking isn't trivial. I see two main ways of doing this:
The direct representation. Your data structure is represented in memory as node objects/structures, where each node has a reference/pointer to the structs below it on the stack (one could also make the references bi-directional, as an alternative). This is the way lists and trees are normally represented in memory. It is a bit more complicated in this case, because unlike a tree or a list, where one need only maintain a reference to root node or head node to keep track of the tree, here we would need to maintain a list of references to all the 'top level' nodes.
The adjacency list representation. This is similar to the way that mathematicians like to think about graphs: G = (V, E). You maintain a list of edges, indexed by the vertices which are the origin and termination points for each edge.
The first option has the advantage that traversal can be quicker, as long as the GSS isn't too flat. But the structure is slightly more difficult to work with. You'll have to roll a lot of your own algorithms.
The second option has the advantage of being more straightforward to work with. Most algorithms in textbooks seem to assume some kind of adjacency list representation, which makes is easier to apply the wealth of graph algorithms out there.
Some resources:
There are various types of adjacency list, e.g. hash table based, array based, etc. The wikipedia adjacency list page is a good place to start.
Here's a blog post from someone who has been grappling with the same issue. The code is clojure, which may or may not be familiar, but the discussion is worth a look, even if not.
I should mention that I think that I wish there were more information about representing Directed Acyclic Graphs (or Graph Structured Stacks, if you prefer), given the widespread application of this sort of model. I think there is room for better solutions to be found.

Finding the most frequent subtrees in a collection of (parse) trees

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.
EDIT #mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".
<sentence>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
<verbPhrase>
<verb/>
<prepositionPhrase>
<preposition/>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
</prepositionPhrase>
</verbPhrase>
</sentence>
Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:
<nounPhrase>
<article/>
<adjective/>
<adjective/>
<noun/>
</nounPhrase>
The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.
Obviously the shorter the tree the more frequent, so nounPhrase will be very common.
EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).
Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.
That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.
It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).
I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like
(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)
Then your problem becomes LCS.
Wikibooks has a java implementation of LCS here
This is a well-known problem in computer science, for which there are efficient solutions.
Here are some relevant references:
Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.
Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.
Or, if you just want fast code, go here:
FREQT
(transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)
I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/

Resources