LightGBM tree structure extracting and interpretation - lightgbm

I'm trying to extract exact tree structure of trees constructed by Light GBM algorithm. I've used model_dump for this and I've extracted structure successfully. The tree looks like this (trained on iris dataset):
I'm having trouble with understanding how to evaluate this tree. Another topic on Stack Overflow says that leaf value is raw prob. before sigmoid. It seems perfectly logical but when I tried to put this value in sigmoid function, probabilities returned by lightgbm.predict(...) were different than probabilities calculated based on my tree. Is there any postprocessing hidden inside before model_dump that affects those values? What is the correct way to calculate output prob. from those trees.

Iris dataset is multiclass classification. The tree you are showing us is for one class. Depending on the objective you are using, the postprocessing is not always sigmoid. For multiclass, you need to apply softmax function to your raw predictions.

Related

Algorithm: Understand when two lines chart are similar

I am trying to develop a script capable of understanding when two lines chart are similar (they have similar direction or similar values).
For instance suppose I have two arrays:
array1 = [0,1,2,3,4,5,6,7,8,9,10];
array2 = [2,3,4,5,6,7,8,8,10,11,12];
As you can see they both growth and their values are quite similar.
At the moment I have found a perfectly working solution using a DTW algorithm.
The problem is that the DTW has a "training part" very fast (I just have to store a lot of lines chart) but it has a heavy prediction part because it compares the last line chart with all the others in memory.
So my question is: is it possible to move the computational complexity time during the training part in order to have a faster prediction?
For example creating a search tree or something like that?
And if it is possible accordingly to which specific value can I cluster the information?
Do you have any advice or useful links?
It is often possible by mapping the objects from your domain to a linear space. For example, you can see how that works for word embeddings in natural languages (word2vec tutorial, skip to "Visualizing the Learned Embeddings"). In this setting, similarity between objects is defined by a distance in the linear space, which is very fast to compute.
How complex should the mapping be in your case greatly depends on your data: how diverse are the charts and what kind of similarity you wish to capture.
In your example with two vectors, it's possible to compute a single value: the slope of the regression line. This will probably work is your charts are "somewhat linear" in nature. If you'd like to capture sinusoidal patterns as well, you can try to normalize the time series by subtracting the first value. Again, in your particular example it'll show a perfect fit.
Bottom line: complexity of the mapping is determined by the complexity of the data.
If they always have the same length, Pearson correlation should be much more appropriate, and much faster.
If you standardize your vectors, Pearson is Euclidean and you can use any multidimensional search tree for further acceleration.

How to implement decision trees in boosting

I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples.
Let say TR.count = m,
so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?
As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.
How to apply the weights?
Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.
Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N.
If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

Confidence of categorical predictions in MATLAB to generate lift charts

For a homework assignment I need to fit several classification models to a data set and compare their lift charts to determine the most effective model. The models produce a binary result (or a probability of that binary result), lets call them YES or NO. Models with continuous output are easy to generate lift charts for as its easy to order the data set in descending order of confidence.
I am having trouble doing that with models that generate a binary result (k-NN and ClassificationTree) for example. In my head I know methods to create a confidence value but I don't know how to do it with these libraries.
For k-NN I would set the probability confidence to the probability of a YES in the training data that falls through a particular path in the tree. However with this method, and the tree model in MATLAB, I don't know which tree path each record falls through.
Similarly with k-NN I would take the probability based upon the k neighbors, and find the probability of a YES from those k neighbors, but the model doesn't tell me the k neighbors and I'd prefer to not do a search for them.
Any help with one or both of these problems (or a better way of producing lift charts in MATLAB is greatly appreciated)
I was actually able to find the answer to my own question. The predict function in MATLAB produces scores for the probability of each type of class in the prediction model
[class, score] = predict(mdl, new_observation);

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.

Resources