How to get the SHAP values of HuggingFace VisualBERT transformer? - huggingface-transformers

I am trying to extract the SHAP values of VisualBert for "vqa" tasks. On SHAP official documentation there are examples for only text classification . but i don`t know how to extract SHAP values for visualBert. Can anyone help?
I tried pipeline method to for SHAP values like:
`
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
visualbert_vqa = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa")
from transformers import pipeline
pipe = pipeline("visual-question-answering", model=visualbert_vqa, tokenizer=bert_tokenizer)
` But there was an error:
"The model 'VisualBertForQuestionAnswering' is not supported for visual-question-answering. Supported models are ['ViltForQuestionAnswering']."
Is there any other method to get the SHAP values of visualBert on any task.
Thanks in advance.

Related

LightGBM 'Using categorical_feature in Dataset.' Warning?

From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:
cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)
However, I received the following error message:
/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
Why did I get the warning message?
I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.
To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.
Like user andrey-popov described you can use the lgb.train's categorical_feature parameter to get rid of this warning.
Below is a simple example with some code how you could do it:
# Define categorical features
cat_feats = ['item_id', 'dept_id', 'store_id',
'cat_id', 'state_id', 'event_name_1',
'event_type_1', 'event_name_2', 'event_type_2']
...
# Define the datasets with the categorical_feature parameter
train_data = lgb.Dataset(X.loc[train_idx],
Y.loc[train_idx],
categorical_feature=cat_feats,
free_raw_data=False)
valid_data = lgb.Dataset(X.loc[valid_idx],
Y.loc[valid_idx],
categorical_feature=cat_feats,
free_raw_data=False)
# And train using the categorical_feature parameter
lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
verbose_eval=20,
categorical_feature=cat_feats,
num_boost_round=1200)
This is less of an answer to the original OP and more of an answer to people who are using sklearn API and encounter this issue.
For those of you who are using sklearn API, especially using one of the cross_val methods from sklearn, there are two solutions you could consider using.
Sklearn API solution
A solution that worked for me was to cast categorical fields into the category datatype in pandas.
If you are using pandas df, LightGBM should automatically treat those as categorical. From the documentation:
integer codes will be extracted from pandas categoricals in the
Python-package
It would make sense for this to be the equivalent in the sklearn API to setting categoricals in the Dataset object.
But keep in mind that LightGBM does not officially support virtually any of the non-core parameters for sklearn API, and they say so explicitly:
**kwargs is not supported in sklearn, it may cause unexpected issues.
Adaptive Solution
The other, more sure-fire solution to being able to use methods like cross_val_predict and such is to just create your own wrapper class that implements the core Dataset/Train under the hood but exposes a fit/predict interface for the cv methods to latch onto. That way you get the full functionality of lightGBM with only a little bit of rolling your own code.
The below sketches out what this could look like.
class LGBMSKLWrapper:
def __init__(self, categorical_variables, params):
self.categorical_variables = categorical_variables
self.params = params
self.model = None
def fit(self, X, y):
my_dataset = ltb.Dataset(X, y, categorical_feature=self.categorical_variables)
self.model = ltb.train(params=self.params, train_set=my_dataset)
def predict(self, X):
return self.model.predict(X)
The above lets you load up your parameters when you create the object, and then passes that onto train when the client calls fit.

How to save fasttext model in binary and text formats?

The documentation is a bit unclear how to save the fasttext model to disk - how do you specify a path in the argument, I tried doing so and it failed with an error
Example in documentation
>>> from gensim.test.utils import get_tmpfile
>>>
>>> fname = get_tmpfile("fasttext.model")
>>>
>>> model.save(fname)
>>> model = FastText.load(fname)
Furthermore, how can I save the model in text format like can be done with word2vec models?
'word2vecmodel.wv.save_word2vec_format("D:\w2vmodel.txt")'
EDIT
After trying the suggestion to make a file first I keep kgetting the same error as before when I run this code
savepath = os.path.abspath('D:\fasttextmodel.v3.bin');
from gensim.test.utils import get_tmpfile
fname = get_tmpfile(savepath)
fasttext_model.save(fname)
TypeError: file must have a 'write' attribute
Documentation in FastText save()/load() example is misleading, they suggest you use get_tmpfile. I am able to save the model if I pass the data file name as a string and do not wrap it in get_tmpfile:
model.save("fasttext.model")
Then you can load the same way, passing the string directly:
model = FastText.load("fasttext.model")
Note that this will save multiple files for models that are large. However, when you load the model, you only need to specify the main fasttext.model file, and the function will automatically load additional files, if there are any.
Did you try creating a file in your local directory called "fasttext.model" before trying to save it?
Also, I'm assuming you trained the model before this correct?

pyLDAvis with Mallet LDA implementation : LdaMallet object has no attribute 'inference'

is it possible to plot a pyLDAvis with a Mallet implementation of LDA ? I have no troubles with LDA_Model but when I use Mallet I get :
'LdaMallet' object has no attribute 'inference'
My code :
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(mallet_model, corpus, id2word)
vis
Run this line to convert the class of your mallet model into a LdaModel before pyLDAvis
[Edit]: edited code to use the inbuilt function in gensim instead. I just tried it but am unable to get meaningful results with the pyLDAvis on a converted mallet model; the topics seem to contain random terms.. Anybody encountered this before?
import gensim
model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(mallet_model)
Got this from the link below, do explore it, lines 565 - 590
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/ldamallet.py#L359
I hope I have helped.
from gensim.models.ldamodel import LdaModel
def convertldaGenToldaMallet(mallet_model):
model_gensim = LdaModel(
id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
alpha=mallet_model.alpha, eta=0,
)
model_gensim.state.sstats[...] = mallet_model.wordtopics
model_gensim.sync_state()
return model_gensim
I found this blog post helpful, directly using the statefile produced by MALLET which is also produced using Gensim's Mallet wrapper.

How to export a sparkR model as PMML?

I'm trying to export a sparkR model as PMML.
The first approach was using the pmml library:
library(pmml)
sparkR.session()
data(iris)
df <- createDataFrame(iris)
model <- spark.kmeans(df, Sepal_Length ~ Sepal_Width, k = 4, initMode = "random")
model_pmml <- pmml(model)
The error:
Error in UseMethod("pmml"): no applicable method for 'pmml' applied to an object of class "KMeansModel"
Traceback:
1. pmml(model)
I also investigated if the toPMML method available on scala models could be used from SparkR. I've found a question that suggests it may be possible with Sparklyr, but not with SparkR.
Any ideas?
I have come to the conclusion that exporting a spark R model is not supported. I have added a feature request for this: https://issues.apache.org/jira/browse/SPARK-21430. Please vote on the jira ticket if you are also looking for this functionality.

Convert multiple querysets to json in django

I asked a related question earlier today
In this instance, I have 4 queryset results:
action_count = Action.objects.filter(complete=False, onhold=False).annotate(action_count=Count('name'))
hold_count = Action.objects.filter(onhold=True, hold_criteria__isnull=False).annotate(action_count=Count('name'))
visible_tags = Tag.objects.filter(visible=True).order_by('name').filter(action__complete=False).annotate(action_count=Count('action'))
hidden_tags = Tag.objects.filter(visible=False).order_by('name').filter(action__complete=False).annotate(action_count=Count('action'))
I'd like to return them to an ajax function. I have to convert them to json, but I don't know how to include multiple querysets in the same json string.
I know this thread is old, but using simplejson to convert django models doesn't work for many cases like decimals ( as noted by rebus above).
As stated in the django documentation, serializer looks like the better choice.
Django’s serialization framework provides a mechanism for
“translating” Django models into other formats. Usually these other
formats will be text-based and used for sending Django data over a
wire, but it’s possible for a serializer to handle any format
(text-based or not).
Django Serialization Docs
You can use Django's simplejson module. This code is untested though!
from django.utils import simplejson
dict = {
'action_count': list(Action.objects.filter(complete=False, onhold=False).annotate(action_count=Count('name')).values()),
'hold_count': list(Action.objects.filter(onhold=True, hold_criteria__isnull=False).annotate(action_count=Count('name')).values()),
...
}
return HttpResponse( simplejson.dumps(dict) )
I'll test and rewrite the code as necessary when I have the time to, but this should get you started.

Resources