LightGBM plot_tree() Leaf numbers - lightgbm

What do the numbers on the LightGBM plot_tree method represent? As an example, I used the Pima Indian Diabettes dataset and then used the plot_tree method to yield the following:
What do the numbers on the leaf nodes represent?

Do you mean leaf numbers like "leaf 1", "leaf 2", ect.
They are just labels that represent each leaf output.
To understand it better, I would suggest checking the following URL:
https://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/

According to the API description, It is the predicted value.
Actually, It's residual on the leaf.
The each residuals are combined to generate the final estimate.
By the way,
There are many articles on Gradient Boosting Decision Tree Algorithm,
but one of the simplest explanations is here.

Related

Create matching of row and column entries such that values are maximized

Currently I am facing the following optimization problem and I cant seem to find the right applicable algorithm for this. This has to do with some of the combinatorial optimzation problems such as the knapsack problem but my mathematical knowledge is limited to that extent.
assume we have a list of the following words: ["apple", "banana", "cookie", "donut", "ear", "force"] Further, assume we have a dataset of texts which, among others, include these words. At some point I compute a cofrequency matrix, that is, a matrix of each of the word combinations the frequency in which the words combine together in all of the files. e.g. cofreq("apple", "banana") = (amount of files which have apple and banana)/(total files). Therefore, cofreq(apple, banana) = cofreq(banana, apple). We ignore cofreq(apple, apple)
Assume we have the following computed matrix (as an image, adding tables seems to be impossible): Table
The goal now is to create unique word pairs such that the word frequencies are maximized and each of the word pairs have a "partner" (We assume we have an even number of words). In this example it would be:
(apple, force) 0.4
(cookie, donut) 0.5
(banana, ear) 0.05
------------------+--
.95
In this case I did it by hand but I know that there is a good algorithm for it, but I cant seem to find it. I was hoping someone could point me in the right direction in the form of a research paper or such.
You need to use a maximum weight matching algorithm to compute this maximal sum pairing.
The table you have in input can be seen as the adjacency matrix of a graph, where the values in the table correspond to the graph's edges weight. You can do it since the cofreq value is commutative (meaning cofreq(apple, banana) == cofreq(banana, apple)).
The matching algorithm you can use here is called the blossom algorithm. It is not trivial, but very elegant. If you have some experience in implementing complex algorithms, you can implement it. Otherwise, there exists implementations of it in graph libraries for most of the common laguages.

How is the class center for a decision attribute calculated in class center based fuzzification algorithm?

I came across class center based fuzzification algorithm on page 16 of this research paper on TRFDT. However, I fail to understand what is happening in step 2 of this algorithm (titled in the paper as Algorithm 2: Fuzzification). If someone could explain it by giving a small example it would certainly be helpful.
It is not clear from your question which parts of the article you understand and IMHO the article is written in not the clearest possible way, so this is going to be a long answer.
Let's start with some intuition behind this article. In short I'd say it is: "let's add fuzziness everywhere to decision trees".
How a decision tree works? We have a classification problem and we say that instead of analyzing all attributes of a data point in a holistic way, we'll analyze them one by one in an order defined by the tree and will navigate the tree until we reach some leaf node. The label at that leaf node is our prediction. So the trick is how to build a good tree i.e. a good order of attributes and good splitting points. This is a well studied problem and the idea is to build a tree that encode as much information as possible by some metric. There are several metrics and this article uses entropy which is similar to widely used information gain.
The next idea is that we can change the classification (i.e. split of the values into a classes) as fuzzy rather than exact (aka "crisp"). The idea here is that in many real life situations not all members of the class are equally representative: some a more "core" examples and some a more "edge" example. If we can catch this difference, we can provide a better classification.
And finally there is a question of how similar the data points are (generally or by some subset of attributes) and here we can also have a fuzzy answer (see formulas 6-8).
So the idea of the main algorithm (Algorithm 1) is the same as in the ID3 tree: recursively find the attribute a* that classifies the data in the best way and perform the best split along that attribute. The main difference is in how the information gain for the best attribute selection is measured (see heuristic in formulas 20-24) and that because of fuzziness the usual stop rule of "only one class left" doesn't work anymore and thus another entropy (Kosko fuzzy entropy in 25) is used to decide if it is time to stop.
Given this skeleton of the algorithm 1 there are quite a few parts that you can (or should) select:
How do you measure μ(ai)τ(Cj)(x) used in (20) (this is a measure of how well x represents the class Cj with respect to attribute ai, note that here being not in Cj and far from the points in Cj is also good) with two obvious choices of the lower (16 and 18) and the upper bounds (17 and 19)
How do you measure μRτ(x, y) used in (16-19). Given that R is induced by ai this becomes μ(ai)τ(x, y) which is a measure of similarity between two points with respect to attribute ai. Here you can choose one of the metrics (6-8)
How do you measure μCi(y) used in (16-19). This is the measure of how well the point y fits in the class Ci. If you already have data as fuzzy classification, there is nothing you should do here. But if your input classification is crisp, then you should somehow produce μCi(y) from that and this is what the Algorithm 2 does.
There is a trivial solution of μCj(xi) = "1 if xi ∈ Cj and 0 otherwise" but this is not fuzzy at all. The process of building fuzzy data is called "fuzzification". The idea behind the Algorithm 2 is that we assume that every class Cj is actually some kind of a cluster in the space of attributes. And so we can measure the degree of membership μCj(xi) as the distance from the xi to the center of the cluster cj (the closer we are, the higher the membership should be so it is really some inverse of a distance). Note that since distance is measured by attributes, you should normalize your attributes somehow or one of them might dominate the distance. And this is exactly what the Algorithm 2 does:
it estimates the center of the cluster for class Cj as the center of mass of all the known points in that class i.e. just an average of all points by each coordinate (attribute).
it calculates the distances from each point xi to each estimated center of class cj
looking into formula at step #12 it uses inverse square of the distance as a measure of proximity and just normalizes the value because for fuzzy sets Sum[over all Cj](μCj(xi)) should be 1

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

phylogenetic tree comparison

I developed new algorithm for phylogenetic tree comparison(phylogenetic tree is simply rooted binary tree). As an input we have two trees, we want to calculate their similarity percentage. one example of these type of algorithms is here.
But most of these algorithms(as I know all of them) did not offer a good way to check the accuracy of their algorithms. e.g if you look at the following figure, you can see there is more similarity between T1 and T3, versus T1 and T2.
I need a method for checking its accuracy of similarity measure, To be sure that my algorithm is better than previous algorithms !!! (it is not difficult in most of the case by human eyes but I don't know how to extend it to my application)
your validity measure should be independent from algorithm.
Take a look at "Graph similarity scoring and matching" and "A Method for Comparing Two Hierarchical Clusterings". Maybe they (or linked references) will be helpful.

Resources