SHAP Summary plot for LIghtGBM - lightgbm

How to explain below shap summary plot for each class. I have checked below document for the explanation still is not very clear to me. Please explain.
https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20LightGBM.html

A late answer, but for lgbm classifier, the shap_values obtained from shap.TreeExplainer() are a list of len = number of classes. So for a binary case, it's a list of 2 arrays, where one array is the negative of the other (as expected). As a result, plotting it as is does not provide a lot of information as the bars of each class for a feature are equal in length.
To get more information from the shap summary plot, use the index associated with your class of interest (e.g., 1 for positive class).
Code example for binary classification -
shap_values = shap.TreeExplainer(lgbm_model.shap_values(x_test))
len(shap_values) == 2
shap.summary_plot(shap_values[1], x_test)

'Relationship' is most important variable for classifying both the class as width is maximum for both- blue and red.
For any variable, lets say age if blue part is more than red, we can say variable is more important to identify class 0, compare to class 1. In your case all variables seems like there is no such property.

Related

How to determine which label is considered the 'positive' class in H2O binary classifier?

Training a binary classifier using h2o.ai and would like to know which label is being considered to be the 'positive' class. This makes a difference since if have labels say, 'give cookie' and 'don't give cookie', and are trying to optimize to maximize recall, depending on which label is the 'positive' class we will be giving out more ('give cookie' is positive class) or less ('don't give cookie' as positive class) cookies.
Another post on SO (How do I specify the positive class in an H2O random forest or other binary classifier?) seems to imply that level values are assigned by alpha order by default ('a' being the lowest level and 'z' being the highest) and just trying to confirm here as it's own explicit question.
Also, is there a way to see which class is currently the 'positive' class for a model (ie. based on the ordering of the confusion matrix labels when using the some_h20_model.confusion_matrix(...) output command)?
What you are verifying is correct, H2O-3 orders levels in lexicographical order.
You can use this confusion matrix as an example of how a confusion matrix will be ordered (i.e. if you have categoricals and you sort them in alphabetical order they will map over to the 0,1,2... as shown)
And here is an example of a binary outcome with No and Yes, where No maps to 0 and Yes maps to 1.

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

Stanford Classifier: What are non ngram activeFeatures used to determine scoreOf Datum?

I have number of classifiers to determine whether event descriptions fall into certain categories, i.e. a rock concert, a jazz evening, classical music, etc or not. I have created a servlet which uses the LinearClassifier scoresOf function to return a score for the event description's datum.
In order to look at cases which return unexpected results, I adapted the scoreOf function (public Counter scoresOf(Datum example)) in order to get an array of the individual features and their scores, so I could understand how the final score was arrived at. This works for the most part, i.e. I mostly have lines like:-
1-#-jazz -0.6317620789568879
1-#-saxo -0.2449097451977173
as I'd expect. However I also have a couple, which I don't understand:-
CLASS 1.4064007882810108
1-Len-31-Inf 0.4569598446321162
Can anybody please help by explaining what these are and how these scores are determined? (I really thought I was just working on a score built up from the weighted components of my description string).
(I appreciate that "CLASS" & "Len-xx" are set as properties for the classifier, I just don't understand why they then show up as scored elements in their own right)
For what you want for seeing feature weights, you might also look at LinearClassifier's justificationOf(). I think it's the same as what you've been writing....
For the questions:
The CLASS feature acts as a class prior or bias term. It will have a more positive weight to the extent that the class is more common in the data overall. You will get this feature iff you use the useClassFeature property. But it's generally a good idea to have it.
The 1-Len feature looks at the length of the String that is column 1. 31-Inf is a length of over 30. This will again have weights as to whether such a length is indicative or not of a particular class. This is employed iff you use the binnedLengths feature. This is useful only if there is some general correlation between field length and the target class.

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

Corresp biplot in R - column scores plotted as 0

I am quite new to R, I am trying to do a Corresp analysis (MASS package) on summarized data. While the output shows row and column score, the resulting biplot shows the column scores as zero, making the plot unreadable (all values arranged by row scores in an expected manner, but flat along the column scores).
the code is
corresp(some_data)
biplot(corresp(some_data, nf = 2))
I would be grateful for any suggestions as to what I'm doing wrong and how to amend this, thanks in advance!
Martin
link to the image
the plot
corresp results
As suggested here:
http://www.statsoft.com/textbook/correspondence-analysis
the biplot actually depicts distributions of the row/column variables over 2 extracted dimensions where the variables' dependency is "the sharpest".
Looks like in your case a good deal of dependencies is concentrated along just one dimension, while the second dimension is already mush less significant.
It does not seem, however, that you relationships are weak. On the contrary, looking at your graph, one can observe the red (column) variable's interception with 2 distinct regions of the other variable values.
Makes sense?
Regards,
Igor

Resources