Building a Decision Tree based on a sparse, multi-value matrix - algorithm

Decision tree learning algos have a general advantage over eg NNs / RNNs, that their internal structure are humanly comprehensible, and can be sanity checked; and can improve even human decision making. Algorithms like ID3 are excellent for building toy models from even non-toy level data.
For a recent project of mine, there is a large existing database, which contains the outcome column for all rows; however, there are 2 extra complexities:
Not all column values are filled for every training set - the matrix is sparse, and
Values aren't binary -there are 2-5 (up to ~7) values for many columns.
I'd like to automatically train & learn the closest-matching decision tree given the data above, and so I turn to the nice community of stackoverflow:
What specific algorithm can I use to extract a decision tree, given the constrains above?
Edit note: insta-accept for: naming algo, giving a short description on how it works, and linking to an OSS implementation (in any language) thereof. Thanks!

Related

Human-interpretable supervised machine learning algorithm

I'm looking for a supervised machine learning algorithm that would produce transparent rules or definitions that can be easily interpreted by a human.
Most algorithms that I work with (SVMs, random forests, PLS-DA) are not very transparent. That is, you can hardly summarize the models in a table in a publication aimed at a non-computer scientist audience. What authors usually do is, for example, publish a list of variables that are important based on some criterion (for example, Gini index or mean decrease of accuracy in the case of RF), and sometimes improve this list by indicating how these variables differ between the classes in question.
What I am looking is a relatively simple output of the style "if (any of the variables V1-V10 > median or any of the variables V11-V20 < 1st quartile) and variable V21-V30 > 3rd quartile, then class A".
Is there any such thing around?
Just to constraint my question a bit: I am working with highly multidimensional data sets (tens of thousands to hundreds of thousands of often colinear variables). So for example regression trees would not be a good idea (I think).
You sound like you are describing decision trees. Why would regression trees not be a good choice? Maybe not optimal, but they work, and those are the most directly interpretable models. Anything that works on continuous values works on ordinal values.
There's a tension between wanting an accurate classifier, and wanting a simple and explainable model. You could build a random decision forest model, and constrain it in several ways to make it more interpretable:
Small max depth
High minimum information gain
Prune the tree
Only train on "understandable" features
Quantize/round decision threhsolds
The model won't be as good, necessarily.
You can find interesting research in the understanding AI methods done by Been Kim at Google Brain.

Literature on many-vs-many classifier

In the context of Multi-Class Classification (MCC) problem,
a common approach is to build final solution from multiple binary classifiers.
Two composition strategy typically mentioned are one-vs-all and one-vs-one.
In order to distinguish the approach,
it is clearer to look at what each binary classifier attempt to do.
One-vs-all's primitive classifier attempt to separate just one class from the rest.
Whereas one-vs-one's primitive attempts to separate one against
One-vs-one is also, quite confusingly, called all-vs-all and all-pairs.
I want to investigate this rather simple idea of building
MCC classifier by composing binary classifier in
binary-decision-tree-like manner.
For an illustrative example:
has wings?
/ \
quack? nyan?
/ \ / \
duck bird cat dog
As you can see the has wings? does a 2-vs-2 classification,
so I am calling the approach many-vs-many.
The problem is, I don't know where to start reading.
Is there a good paper you would recommend?
To give a bit more context,
I'm considering using a multilevel evolutionary algorithm (MLEA) to build the tree.
So if there is an even more direct answer, it would be most welcomed.
Edit: For more context (and perhaps you might find it useful),
I read this paper which is one of the GECCO 2011 best paper winners;
It uses MLEA to compose MCC in one-vs-all manner.
This is what inspired me to look for a way to modify it as decision tree builder.
What you want looks very much like Decision Trees.
From wiki:
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.
Sailesh's answer is correct in that what you intend to build is a decision tree. There are many algorithms already for learning such trees such as e.g. Random Forests. You could e.g. try weka and see what is available there.
If you're more interested in evolutionary algorithms, I want to mention Genetic Programming. You can try for example our implementation in HeuristicLab. It can deal with numeric classes and attempts to find a formula (tree) that maps each row to its respective class using e.g. mean squared error (MSE) as fitness function.
There are also instance-based classification methods like nearest neighbor or kernel-based methods like support vector machines. Instance-based method also support multiple classes, but with kernel-methods you have to use one of the approaches you mentioned.

Finding an optimum learning rule for an ANN

How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.

Association mining with large number of small datasets

I have a large number (100-150) of small (approx 1 kbyte) datasets.
We will call these the 'good' datasets.
I also have a similar number of 'bad' datasets.
Now I'm looking for software (or perhaps algorithm(s)) to find rules for what constitutes a 'good' dataset versus a 'bad' dataset.
The important thing here is the software's ability to deal with the multiple datasets rather than just one large one.
Help much appreciated.
Paul.
It seems like a classification problem. If you have many datasets labelled as "good" or "bad" you can train a classifier to predict if a new dataset is good or bad.
Algorithms such as decision tree, k-nearest neighboor, SVM, neural networks are potential tools that you could use.
However, you need to determine which attributes you will use to train the classifier.
One common way to do it is using the k-nearest neighbor.
Extract fields from your data set, for example - if your dataset is a text, a common way to extract fields is using the bag of words.
Store the "training set", and when a new dataset [which is not labled] arrives - find the k nearest beighbors to it [according to the extracted fields]. Lable the new dataset like the most k nearest neighbors [from training set] of it.
Another common method is using a decision tree. The problem with decision trees - don't make the decisioning too specific. An existing algorithm which might use to create a good [heuristically] tree is ID3

Help Understanding Cross Validation and Decision Trees

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
Decide on the number of folds you want (k)
Subdivide your dataset into k folds
Use k-1 folds for a training set to build a tree.
Use the testing set to estimate statistics about the error in your tree.
Save your results for later
Repeat steps 3-6 for k times leaving out a different fold for your test set.
Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
Any help with this little wrinkle would be much appreciated.
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick?
The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.
With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).
In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...
For the first part, and like the others have pointed out, we usually use the entire dataset for building the final model, but we use cross-validation (CV) to get a better estimate of the generalization error on new unseen data.
For the second part, I think you are confusing CV with the validation set, used to avoid overfitting the tree by pruning a node when some function value computed on the validation set does not increase before/after the split.
Cross validation isn't used for buliding/pruning the decision tree. It's used to estimate how good the tree (built on all of the data) will perform by simulating arrival of new data (by building the tree without some elements just as you wrote). I doesn't really make sense to pick one of the trees generated by it because the model is constrained by the data you have (and not using it all might actually be worse when you use the tree for new data).
The tree is built over the data that you choose (usualy all of it). Pruning is usually done by using some heuristic (i.e. 90% of the elements in the node belongs to class A so we don't go any further or the information gain is too small).
The main point of using cross-validation is that it gives you better estimate of the performance of your trained model when used on different data.
Which tree do you pick? One option would be that you bulid a new tree using all your data for training set.
It has been mentioned already that the purpose of the cross-validation is to qualify the model. In other words cross-validation provide us with an error/accuracy estimation of model generated with the selected "parameters" regardless of the used data.
The corss-validation process can be repeated using deferent parameters untill we are satisfied with the performance. Then we can train the model with the best parameters on the whole data.
I am currently facing the same problem, and I think there is no “correct” answer, since the concepts are contradictory and it’s a trade-off between model robustness and model interpretation.
I basically chose the decision tree algorithm for the sake of easy interpretability, visualization and straight forward hands-on application.
On the other hand, I want to proof the robustness of the model using cross-validation.
I think I will apply a two step approach:
1. Apply k-fold cross-validation to show robustness of the algorithm with this dataset
2. Use the whole dataset for the final decision tree for interpretable results.
You could also randomly choose a tree set of the cross-validation or the best performing tree, but then you would loose information of the hold-out set.

Resources