create decision tree from data - algorithm

I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.
I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:
Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.
Can someone direct me, what algorithm should i use, to build decision tree in this situation?

I searched stackoverflow and i found ID3 and C4.5 algorithms. But i
don't know if i should use them.
Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)
If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:
Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
Choose a property pi
Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.
Some pointers:
At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
After the tree is created, some pruning to it is done, in order to avoid overfitting.
A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.

Related

Data structure to represent a graph

Having a couple of cities and their locations I want to create a data structure that would represent a graph like this. This graph represent all possible paths that can be taken in order to visit every city only once:
My question is, since this is probably a very common problem, is there an algorithm or already made data structure to represent this? The programming language is not important (although I would prefer java).
Your problem seems very close to the traveling salesman problem, a classic among the classics.
As you did intuite, the graph that will represent all the possible solutions is indeed a tree (the paths from the root to any of its leaf should represent a solution).
From there, you can ask yourself several questions:
Is the first city that I'll visit an important piece of information, or is it only the order that matters ? For instance, is London-Warsaw-Berlin-Lidz equivalent to Warsaw-Berlin-Lidz-London ?
Usually, we consider these solutions as being equivalent to solve a TSP, but it might not be the case for you.
Did you see the link between a potential solution to the TSP and a permutation ? Actually, what you're looking for is a way (and the data structure that goes with it) to generate all the permutations of a given set(your set of cities).
With these two points in mind, we can think about a way to generate such a tree. A good strategy to work with trees is to think recursively.
We have a partial solution, meaning the k first cities. Then, the next possible city can be any among the n-k cities remaining. That gives the following pseudo-code.
get_all_permutations(TreeNode node, Set<int>not_visited){
for (city in not_visited){
new_set = copy(not_visited);
new_set.remove(city);
new_node = new TreeNode();
new_node.city = city;
node.add_child(new_node);
get_all_permutations(new_node, new_set);
}
}
This will build the tree recursively.
Depending on your answer to the first point I mentioned (about the importance of the first city), you might want to assign a city to the root node, or not.
Some good points to look in, if you want to go further with this kind of problem/thinking, are enumeration algorithms, and recursive algorithms. They're generally a good option when your goal is to enumerate all the elements of a set. But they're also generally an inefficient way to solve problems (for example, in the case of the TSP, solving using this algorithm results in a very inefficient approach. There are some much much better ones).
This tree is bad. There are redundant data in it. For instance connection between nodes 2 and 4 occurs three times in the tree. You want a "structure" that automatically gives the solution to your problem, so that it's easier for you, but that's not how problem solving works. Input data is one set of data, output data is another set of data, and they could appear similar, but they can also be quite different.
One simple matrix with one triangle empty and the other containing data should have all the information you need. Coordinates of the matrix are nodes, cells are distances. This is your input data.
What you do with this matrix in your code is a different matter. Maybe you want to write all possible paths. Then write them. Use input data and your code to produce output data.
What you are looking for is actually a generator of all permutations. If you keep one city fixed as the first one (London, in your diagram), then you need to generate all permutations of the list of all your remaining nodes (Warsaw, Łódź, Berlin).
Often such an algorithm is done recursively by looping over all elements, taking it out and doing this recursively for the remaining elements. Often libraries are use to achieve this, e. g. itertools.permutations in Python.
Each permutation generated this way should then be put in the resulting graph you originally wanted. For this you can use any graph-representation you would like, e. g. a nested dictionary structure:
{ a: { b: { c: d,
d: c },
c: { b: d,
d, b },
d: { b: c,
c: b } } }

Machine learning method which is able to integrate prior knowledge in a decision tree

Does any of you know a machine learning method or combination of methods which makes it possible to integrate prior knowledge in the building process of a decision tree?
With "prior knowledge" I mean the information if a feature in a particular node is really responsible for the resulting classification or not. Imagine we only have a short period of time where our features are measured and in this period of time we have a correlation between features. If we now would measure the same features again, we probably would not get a correlation between those features, because it was just a coincidence that they are correlated. Unfortunately it is not possible to measure again.
The problem which arises with that is: the feature which is chosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world. In other words the strongly correlated feature is chosen by the algorithm while the other feature is the one which should be chosen. That's why I want to set rules / causalities / constraints for the tree learning process.
"a particular feature in an already learned tree" - the typical decision tree has one feature per node, and therefore each feature can appear in many different nodes. Similarly, each leaf has one classification, but each classification may appear in multiple leafs. (And with a binary classifier, any non-trivial tree must have repeated classifications).
This means that you can enumerate all leafs and sort them by classification to get uniform subsets of leaves. For each such subset, you can analyze all paths from the root of the tree to see which features occurred. But this will be a large set.
"But in my case there are some features which are strongly correlated ... The feature which is choosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world."
It's been said that every model is wrong, but some models are useful. If the features are indeed strongly correlated, choosing this "wrong" feature doesn't really affect the model.
You can of course just modify the split algorithm in tree building. Trivially, "if the remaining classes are A and B, use split S, else determine the split using algorithm C4.5" is a valid splitting algorithm that hardcodes pre-existing knowledge about two specific classes without being restricted to just that case.
But note that it might just be easier to introduce a combined class A+B in the decision tree, and then decide between A and B in postprocessing.

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.

Resources