KeyError when using non-default models in Huggingface transformers pipeline - huggingface-transformers

I have no problems using the default model in the sentiment analysis pipeline.
# Allocate a pipeline for sentiment-analysis
nlp = pipeline('sentiment-analysis')
nlp('I am a black man.')
>>>[{'label': 'NEGATIVE', 'score': 0.5723695158958435}]
But, when I try to customise the pipeline a little by adding a specific model. It throws a KeyError.
nlp = pipeline('sentiment-analysis',
tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
nlp('I am a black man.')
>>>---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-55-af7e46d6c6c9> in <module>
3 tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
4 model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
----> 5 nlp('I am a black man.')
6
7
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in <listcomp>(.0)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
KeyError: 58129

I am facing the same problem. I am working with a model from XML-R fine-tuned with squadv2 data set ("a-ware/xlmroberta-squadv2"). In my case, the KeyError is 16.
Link
Looking for help on the issue I have found this information: link I hope you find it helpful.
Answer (from the link)
The pipeline throws an exception when the model predicts a token that is not part of the document (e.g. final special token [SEP])
My problem:
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
from transformers import pipeline
nlp = pipeline('question-answering',
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2'),
tokenizer= XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2'))
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-b5a8ece5e525> in <module>()
1 context = "Jim Henson was a nice puppet"
2 # --------------- CON INTERROGACIONES
----> 3 nlp(question = "Who was Jim Henson?", context =context)
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
1745 ),
1746 }
-> 1747 for s, e, score in zip(starts, ends, scores)
1748 ]
1749
KeyError: 16
Solution 1: Adding punctuation at the end of the context
In order to avoid the bug of trying to extract the final token (which may be an special one as [SEP]) I added an element (in this case a punctuation mark) at the end of the context:
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet.")
[OUT]
{'answer': 'nice puppet.', 'end': 28, 'score': 0.5742837190628052, 'start': 17}
Solution 2: Do not use pipeline()
The original model can handle itself to retrieve the correct token`s index.
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
Update
Looking in more detail your case, I found that the default model for Conversational task in the pipeline is distilbert-base-cased (source code).
The first solution I posted is not a good solution indeed. Trying other questions I got the same error. However, the model itself outside the pipeline works fine (as I showed in solution 2). Thus, I believe that not all models can be introduced in the pipeline. If anyone has more information about it please help us out. Thanks.

Related

'TypeError: can't pickle _thread.lock objects' in hyperparameter search

This is same as this issue. Transformers version 4.19.4.
I have already seen this very similar question and the related issue, but these solutions don't work in my case.
This is the code I'm using:
args = TrainingArguments(
f"{model_name}-hyperp-{task}",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
weight_decay=0.01,
skip_memory_metrics=True, # https://github.com/huggingface/transformers/issues/12177 [picking error]
)
trainer = Trainer(
model_init=model_init, # function to initialize model (using 'from_pretrained')
args=args,
train_dataset = tokenized_datasets["train"],
eval_dataset = tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.hyperparameter_search(
hp_space=lambda _: tune_config,
backend="ray",
n_trials=10,
resources_per_trial={"cpu": 1, "gpu": 0},
scheduler=scheduler,
keep_checkpoints_num=1,
checkpoint_score_attr="training_iteration",
progress_reporter=reporter,
local_dir="/ray_results/",
name="tune_transformer_pbt",
log_to_file=True,
)
The error is:
TypeError Traceback (most recent call last)
[<ipython-input-50-3716493001d6>](https://localhost:8080/#) in <module>()
33 local_dir="/ray_results/",
34 name="tune_transformer_pbt",
---> 35 log_to_file=True,
36 )
37
...lots of lines...
[/usr/local/lib/python3.7/dist-packages/ray/cloudpickle/cloudpickle_fast.py](https://localhost:8080/#) in dump(self, obj)
618 def dump(self, obj):
619 try:
--> 620 return Pickler.dump(self, obj)
621 except RuntimeError as e:
622 if "recursion" in e.args[0]:
TypeError: can't pickle _thread.lock objects
The complete error trace can be found at the 1st link above.
I have already tried the suggested way to set skip_memory_metrics=True in TrainingArguments, but this doesn't work.
I was using Google Colab.

How to test a model before fine-tuning in Pytorch Lightning?

Doing things on Google Colab.
transformers: 4.10.2
pytorch-lightning: 1.2.7
import torch
from torch.utils.data import DataLoader
from transformers import BertJapaneseTokenizer, BertForSequenceClassification
import pytorch_lightning as pl
dataset_for_loader = [
{'data':torch.tensor([0,1]), 'labels':torch.tensor(0)},
{'data':torch.tensor([2,3]), 'labels':torch.tensor(1)},
{'data':torch.tensor([4,5]), 'labels':torch.tensor(2)},
{'data':torch.tensor([6,7]), 'labels':torch.tensor(3)},
]
loader = DataLoader(dataset_for_loader, batch_size=2)
for idx, batch in enumerate(loader):
print(f'# batch {idx}')
print(batch)
category_list = [
'dokujo-tsushin',
'it-life-hack',
'kaden-channel',
'livedoor-homme',
'movie-enter',
'peachy',
'smax',
'sports-watch',
'topic-news'
]
tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)
max_length = 128
dataset_for_loader = []
for label, category in enumerate(tqdm(category_list)):
# file ./text has lots of articles, categorized by category
# and they are just plain texts, whose content begins from forth line
for file in glob.glob(f'./text/{category}/{category}*'):
lines = open(file).read().splitlines()
text = '\n'.join(lines[3:])
encoding = tokenizer(
text,
max_length=max_length,
padding='max_length',
truncation=True
)
encoding['labels'] = label
encoding = { k: torch.tensor(v) for k, v in encoding.items() }
dataset_for_loader.append(encoding)
SEED=lambda:0.0
# random.shuffle(dataset_for_loader) # ランダムにシャッフル
random.shuffle(dataset_for_loader,SEED)
n = len(dataset_for_loader)
n_train = int(0.6*n)
n_val = int(0.2*n)
dataset_train = dataset_for_loader[:n_train]
dataset_val = dataset_for_loader[n_train:n_train+n_val]
dataset_test = dataset_for_loader[n_train+n_val:]
dataloader_train = DataLoader(
dataset_train, batch_size=32, shuffle=True
)
dataloader_val = DataLoader(dataset_val, batch_size=256)
dataloader_test = DataLoader(dataset_test, batch_size=256)
class BertForSequenceClassification_pl(pl.LightningModule):
def __init__(self, model_name, num_labels, lr):
super().__init__()
self.save_hyperparameters()
self.bert_sc = BertForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels
)
def training_step(self, batch, batch_idx):
output = self.bert_sc(**batch)
loss = output.loss
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
output = self.bert_sc(**batch)
val_loss = output.loss
self.log('val_loss', val_loss)
def test_step(self, batch, batch_idx):
labels = batch.pop('labels')
output = self.bert_sc(**batch)
labels_predicted = output.logits.argmax(-1)
num_correct = ( labels_predicted == labels ).sum().item()
accuracy = num_correct/labels.size(0)
self.log('accuracy', accuracy)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)
checkpoint = pl.callbacks.ModelCheckpoint(
monitor='val_loss',
mode='min',
save_top_k=1,
save_weights_only=True,
dirpath='model/',
)
trainer = pl.Trainer(
gpus=1,
max_epochs=10,
callbacks = [checkpoint]
)
model = BertForSequenceClassification_pl(
MODEL_NAME, num_labels=9, lr=1e-5
)
### (a) ###
# I think this is where I am doing fine-tuning
trainer.fit(model, dataloader_train, dataloader_val)
# this is to score after fine-tuning
test = trainer.test(test_dataloaders=dataloader_test)
print(f'Accuracy: {test[0]["accuracy"]:.2f}')
But I am not really sure how to do a test before fine-tuning, in order to compare two models before and after fine-tuning, in order to show how effective fine-tuning is.
Inserting the following two lines to ### (a) ###:
test = trainer.test(test_dataloaders=dataloader_test)
print(f'Accuracy: {test[0]["accuracy"]:.2f}')
I got this result:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-c8b2c67f2d5c> in <module>()
9
10 # 6-19
---> 11 test = trainer.test(test_dataloaders=dataloader_test)
12 print(f'Accuracy: {test[0]["accuracy"]:.2f}')
13
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in test(self, model, test_dataloaders, ckpt_path, verbose, datamodule)
896 self.verbose_test = verbose
897
--> 898 self._set_running_stage(RunningStage.TESTING, model or self.lightning_module)
899
900 # If you supply a datamodule you can't supply train_dataloader or val_dataloaders
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _set_running_stage(self, stage, model_ref)
563 the trainer and the model
564 """
--> 565 model_ref.running_stage = stage
566 self._running_stage = stage
567
AttributeError: 'NoneType' object has no attribute 'running_stage'
I noticed that Trainer.fit() can take None as arguments other than model, so I tried this:
trainer.fit(model)
test=trainer.test(test_dataloaders=dataloader_test)
print(f'Accuracy: {test[0]["accuracy"]:.2f}')
The result:
MisconfigurationException: No `train_dataloader()` method defined. Lightning `Trainer` expects as minimum a `training_step()`, `train_dataloader()` and `configure_optimizers()` to be defined.
Thanks.
The Trainer needs to call its .fit() in order to set up a lot of things and then only you can do .test() or other methods.
You are right about putting a .fit() just before .test() but the fit call needs to a valid one. You have to feed a dataloader/datamodule to it. But since you don't want to do a training/validation in this fit call, just pass limit_[train/val]_batches=0 while Trainer construction.
trainer = Trainer(gpus=..., ..., limit_train_batches=0, limit_val_batches=0)
trainer.fit(model, dataloader_train, dataloader_val)
trainer.test(model, dataloader_test) # without fine-tuning
The fit call here will just set things up for you and skip training/validation. And then the testing follows. Next time run the same code but without the limit_[train/val]_batches, this will do the pretraining for you
trainer = Trainer(gpus=..., ...)
trainer.fit(model, dataloader_train, dataloader_val)
trainer.test(model, dataloader_test) # with fine-tuning
Clarifying a bit about .fit() taking None for all but model: Its not quite true - you must provide either a DataLoader or a DataModule.

How to use my own corpus on word embedding model BERT

I am trying to create a question-answering model with the word embedding model BERT from google. I am new to this and would really want to use my own corpus for the training. At first I used an example from the huggingface site and that worked fine:
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
'question': "Wat is de hoofdstad van Nederland?"})
output
> {'answer': 'Amsterdam', 'end': 9, 'score': 0.825619101524353, 'start': 0}
So, I tried creating a .txt file to test if it was possible to interchange the sentence in the context parameter with the exact same sentence but in a .txt file.
with open('test.txt') as f:
lines = f.readlines()
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': lines,
'question': "Wat is de hoofdstad van Nederland?"})
But this gave me the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-2bae0ecad43e> in <module>()
10 qa_pipeline({
11 'context': lines,
---> 12 'question': "Wat is de hoofdstad van Nederland?"})
5 frames
/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py in _is_whitespace(c)
84
85 def _is_whitespace(c):
---> 86 if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
87 return True
88 return False
TypeError: ord() expected a character, but string of length 66 found
I was just experimenting with ways to read and use a .txt file, but I don't seem to find a different solution. I did some research on the huggingface pipeline() function and this is what was written about the question and context parameters:
Got it! The solution was really easy. I assumed that the variable 'lines' was already a str but that wasn't the case. Just by casting to a string the question-answering model accepted my test.txt file.
so from:
with open('test.txt') as f:
lines = f.readlines()
to:
with open('test.txt') as f:
lines = str(f.readlines())

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

How to get the highest predicted value in multiclass classification problem using H2O AI?

When predicting values in a multiclass classification problem, I would like to get the probability of the predicted value.
I tried to solve this by using H2O's apply function:
predicted_df = modelo_assessor.predict(to_predict_h2o_frame)
predicted_df.apply((lambda x: x.max()), axis=1)
But it does not work:
'ValueError: unimpl bytecode instr: CALL_METHOD'
Maybe it doesn't work because h2o.max does not have axis parameter as h2o.mean does???
I couldn't find the documentation of which operations are supported on apply function.
I would like to solve the problem using h2o data manipulation similarly to this pandas code:
predicted_df = modelo_assessor.predict(to_predict_h2o_frame).as_data_frame()
predicted_df['PROB_PREDICTED']=predicted_df.iloc[:,1:].max(axis=1)
This is happening whenever using apply. Use the example from H2O documentation:
I was able to solve the problem by downgrading to Python 3.6.x
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
python_lists = [[1,2,3,4], [1,2,3,4]]
h2oframe = h2o.H2OFrame(python_obj=python_lists,
na_strings=['NA'])
colMean = h2oframe.apply(lambda x: x.mean(), axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-8da6b76c71bd> in <module>
2 h2oframe = h2o.H2OFrame(python_obj=python_lists,
3 na_strings=['NA'])
----> 4 colMean = h2oframe.apply(lambda x: x.mean(), axis=0)
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/frame.py in apply(self, fun, axis)
4910 assert_is_type(fun, FunctionType)
4911 assert_satisfies(fun, fun.__name__ == "<lambda>")
-> 4912 res = lambda_to_expr(fun)
4913 return H2OFrame._expr(expr=ExprNode("apply", self, 1 + (axis == 0), *res))
4914
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/astfun.py in lambda_to_expr(fun)
133 code = fun.__code__
134 lambda_dis = _disassemble_lambda(code)
--> 135 return _lambda_bytecode_to_ast(code, lambda_dis)
136
137 def _lambda_bytecode_to_ast(co, ops):
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/astfun.py in _lambda_bytecode_to_ast(co, ops)
147 body, s = _opcode_read_arg(s, ops, keys)
148 else:
--> 149 raise ValueError("unimpl bytecode instr: " + instr)
150 if s > 0:
151 print("Dumping disassembled code: ")
ValueError: unimpl bytecode instr: CALL_METHOD

Resources