Decision Tree clarification - binary-tree

I just want to ask/clarify if decision trees are essentially binary trees where each node is a boolean, and it continues down until a desired result is reached?

Not necessarily. Some nodes may share children which is not the case in Binary trees. However, the essence of the decision tree is what you mentioned.
It's a tree where based on the probability of an outcome you move down the graph until you hit an outcome.
See Wikipedia's page on desicion trees for more info.

As mentioned by Ares, not all decision trees are binary (they can be "n-ary") although most implementation I have seen are binary trees.
For instance if you have a color variable (i.e. categorical) that can take three values : red, blue or green; you might want to split in three directly at a node instead of splitting in two and then again in two (or more).
The choice between binary and "n-ary" will usually depends on your data. I suspect that most people use binary trees anyway because it is relatively easier to implement and more flexible.
Then as you said the tree is developed until the desired outcome is reached. Decision Tree suffers major drawbacks such as overfitting and there exist many different ways to tackle this issue (pruning, boosting, etc.) but this is beyond the scope of the question/answer.
I recommend to have a look at this great visualization that explains well the decision tree : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Will be happy to give more details about decision tree

Related

Merkle tree for finding data inconsistencies - optimizing number of queries

I understand the idea behind using Merkle tree to identify inconsistencies in data, as suggested by articles like
Key Concepts: Using Merkle trees to detect inconsistencies in data
Merkle Tree | Brilliant Math & Science Wiki
Essentially, we use a recursive algorithm to traverse down from root we want to verify, and follow the nodes where stored hash values are different from server (with trusted hash values), all the way to the inconsistent leaf/datablock.
If there's only one such block (leaf) that's corrupted, this means we following a single path down to leaf, which is log(n) queries.
However, in the case of multiple inconsistent data blocks/leaves, we need up to O(n) queries. In the extreme case, all data blocks are corrupted, and our algorithm will need to send every single node to server (authenticator). In the real world this becomes costly due to the network.
So my question is, is there any known improvement to the basic traverse-from-root algorithm? A possible improvement I could think of is to query the level of nodes in the middle. For example, in the tree below, we send the server the two nodes in the second level ('64' and '192'), and for any node that returns inconsistency, we recursively go to the middle level of that sub-tree - something like a binary search based on height.
This increases our best case time from O(1) to O(sqrt(n)), and probably reduces our worst case time to some extent (I have not calculated how much).
I wonder if there's any better approach than this? I've tried to search for relevant articles on Google Scholar, but looks like most of the algorithm-focused papers are concerned with the merkle-tree traversal problem, which is different from the problem above.
Thanks in advance!

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

Data structure: a graph that's similar to a tree - but not a tree

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!
I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Balancing a ternary search tree

How does one go about 'balancing' a ternary search tree? Most tst implementations don't address balancing, but suggest inserting in an optimal order (which I can't control.)
The article in Dr. Dobbs about Ternary Search Trees says: D.D. Sleator and R.E. Tarjan describe theoretical balancing algorithms for ternary search trees in "Self-Adjusting Binary Search Trees" (Journal of the ACM, July 1985). You can find online versions of this paper with your favorite search engine.
One simple optimization is to make it a red-black tree, which can avoid some worst-case scenarios. TSTs are really just binary trees where the value of a given node is another TST. So, the "middle" child of a node is not really part of the tree that is being balanced at each level, as it cannot move to a different parent anyway.
This ensures that each tier of the trie is traversed in log(R) time, although you could probably do even better by taking into account the size of the subtries at each node. That looks to be a lot more complicated though!
A generalization of the binary search tree is the B-Tree, which works for fanouts anywhere from 2 and up. That's not the only way to do it, but it's a common one.
Roughly the way it works is if an insert or delete would put the tree out of balance, it steals an element or a space from a neighboring node. If even that isn't enough to keep the tree in balance, its height by will be changed to make room.
read this article:
"Self-Adjusting of Ternary Search Tries Using Conditional Rotations and Randomized Heuristics"
by
"Ghada Hany Badr∗ and B. John Oommen †"
it will help you to understanding self-adjusting and balancing TSTs.

Resources