Where can I find the definitions of binary trees and the algorithms associated with them in isabelle? - binary-tree

Where can I find the definition of a binary tree and the algorithms associated with binary trees in Isabelle?
I am a beginner in Isabelle and, therefore, I am looking for new learning materials. Recently, I was trying to find the definition of a binary tree and the algorithms on binary trees in Isabelle, but, unfortunately, my attempt failed. Where can I find them? Thank you for your help in advance.

Binary trees are defined in HOL-Library (~~/src/HOL/Library/Tree.thy). Some algorithms on them (i.e. implementations of data structures such as AVL trees with them) are defined in HOL-Data_Structures (~~/src/HOL/Data_Structures/).
Both of these are in the Isabelle distribution. You can import them by writing e.g. "Data_Structures.AVL_Set" or "HOL-Library.Tree" (the quotation marks are required when there's a dash in the name).

Related

Decision Tree clarification

I just want to ask/clarify if decision trees are essentially binary trees where each node is a boolean, and it continues down until a desired result is reached?
Not necessarily. Some nodes may share children which is not the case in Binary trees. However, the essence of the decision tree is what you mentioned.
It's a tree where based on the probability of an outcome you move down the graph until you hit an outcome.
See Wikipedia's page on desicion trees for more info.
As mentioned by Ares, not all decision trees are binary (they can be "n-ary") although most implementation I have seen are binary trees.
For instance if you have a color variable (i.e. categorical) that can take three values : red, blue or green; you might want to split in three directly at a node instead of splitting in two and then again in two (or more).
The choice between binary and "n-ary" will usually depends on your data. I suspect that most people use binary trees anyway because it is relatively easier to implement and more flexible.
Then as you said the tree is developed until the desired outcome is reached. Decision Tree suffers major drawbacks such as overfitting and there exist many different ways to tackle this issue (pruning, boosting, etc.) but this is beyond the scope of the question/answer.
I recommend to have a look at this great visualization that explains well the decision tree : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Will be happy to give more details about decision tree

What are the differences between suffix links and failure links?

I am studying algorithms in this semester and have read about the Aho-Corasick string matching algorithm and Ukkonen's algorithm for building suffix trees.
I read both of them for but can't understand the main basic differences of these two, except that failure links check prefixes and suffix links check suffixes.
What is the difference between these two algorithms?
I think that your understanding of suffix links and failure links is incorrect. In both cases, a suffix/failure link is a pointer from one node in the trie/suffix tree to another node in the trie/suffix tree with the following property: if the original node represents the string x, then the string y encoded by the node pointed at by the suffix/failure link is the node for which y is the longest possible suffix of the string x.
The main difference between the two algorithms is what the algorithms produce rather than what the suffix/failure links mean. Aho-Corasick produces a trie annotated with extra transition information that makes it possible to find all instances of a collection of strings as rapidly as possible. The failure links produced are used both in the construction of the algorithm and in the pattern-matching step. Ukkonen's algorithm produces a suffix tree, using the suffix links only during construction and not during most queries on the tree.
Hope this helps!
The difference is a suffix/dictionary link is like a pointer to the parent of the child. A failure link is from a breadth-first search. Both links are suffixes.

Tarjan's offline Least Common Ancestor Algorithm

I am currently reading about an algorithm from Tarjan on how to get the Least Common Ancestor of two nodes in a Binary Tree.
I have read the pseudo code from Wikipedia, but I'm not getting the gist of it. I mean I am not able to apply the algorithm on any given Binary Tree. I also tried to find some explanation of each steps on Google but i did not get anything worth. So, if anybody can help me in understanding how this algorithm works on a Binary Tree, it would be really appreciable.
Besides the given binary tree , you should implement another data structure named disjoint set to apply this algorithm . There are three main methods along with this data structure , MAKE-SET、UNION、FIND-SET . I strongly recommend you to read Chapter 21 "Data Structures for Disjoint Sets" of 《Introduction to algorithms》 for a better understanding.

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Conceptually simple linear-time suffix tree constructions

In 1973 Weiner gave the first linear-time construction of suffix trees. The algorithm was simplified in 1976 by McCreight, and in 1995 by Ukkonen. Nevertheless, I find Ukkonen's algorithm relatively involved conceptually.
Has there been simplifications to Ukkonen's algorithm since 1995?
A more direct answer to the original question is the top-down (and lazy) suffix tree construction by Giegerich, Kurtz, Stoye: https://pub.uni-bielefeld.de/luur/download?func=downloadFile&recordOId=1610397&fileOId=2311132
In addition, suffix arrays (as mentioned in the previous answer) are not only easier to construct, but they can be enhanced so as to emulate anything you'd expect from a suffix tree: http://www.daimi.au.dk/~cstorm/courses/StrAlg_e04/papers/KurtzOthers2004_EnhancedSuffixArrays.pdf
Since the data structures involved in an enhanced suffix array can be compressed, compressed (emulated) suffix trees become possible: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.8644&rep=rep1&type=pdf
It's not a direct answer, however it can help you.
Last year, while working on the subject, I ended using suffix-arrays instead of suffix-trees, and IIRC, I used the paper "An incomplex algorithm for fast suffix array construction " KB Schürmann (2007) [1] as a reference. IIRC, it's a two pass linear algorithm to build suffix-arrays.
[1] http://scholar.google.com/scholar?q=An+incomplex+algorithm+for+fast+suffix+array+construction+&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on

Resources