Loading a tokenizer on huggingface: AttributeError: 'AlbertTokenizer' object has no attribute 'vocab' - huggingface-transformers

I'm trying to load a huggingface model and tokenizer. This normally works really easily (I've done it with a dozen models):
from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = BertForMaskedLM.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
But for some reason I'm getting an error when I'm trying to load this one:
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=False)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab
I found this question related, but it seems like this was an issue in the git repo itself and not on huggingface. I checked the actual repo where this model is saved on huggingface (link) and it clearly has a vocab file (PubMD-30k-clean.vocab) like the rest of the models I loaded.

There seems to be some issue with the tokenizer. It works, if you remove use_fast parameter or set it true, then you will be able to display the vocab file.
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=True)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab
Output:
{'intervention': 7062,
'▁tongue': 6911,
'▁kit': 8341,
'▁biosimilar': 26423,
'bank': 19880,
'▁diesel': 20349,
'SOD': 6245,
'iri': 17739,
....

Related

How to use Huggingface pretrained models to get the output of the dataset that was used to train the model?

I am working on getting the abstractive summaries of the XSUM and the CNN DailyMail datasets using Huggingface's pre-trained BART, Pegasus, and T5 models.
I am confused because there already exist checkpoints of models pre-trained on the same dataset.
So even if I do:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("mwesner/pretrained-bart-CNN-Dailymail-summ")
model = AutoModelForSeq2SeqLM.from_pretrained("mwesner/pretrained-bart-CNN-Dailymail-summ")
I can't understand how to get the summaries of either dataset since I don't have any new sentences that I can feed in.
This is how a pretrained model is normally used:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
But I need the summaries generated by the pre-trained model on the dataset that was used to train them (XSUM and CNN DailyNews).

How to add a calculated field to a django query expression

I have a Django model, DocumentComments, with two datetime fields, created and updated. I am working on a search function that parses a search string and returns a Q expression to query the DocumentComments model based on the values in the search string.
I need to write something like Q(created.year=xxxx), where created.year is the year in the created datetime field. But "keywords can't be expressions" as Django has been telling me all morning.
I tried using a custom model manager and annotating the default queryset with a year field, but that did not work as I can't seem to access the created.year value in the get_queryset function.
class DocumentCommentManager(models.Manager):
def get_queryset(self):
c_year = self.created.year
u_year = self.updated.year
return super(DocumentCommentManager, self).get_queryset().annotate(created_year=c_year, updated_year=u_year)
What am I missing, or what is a better way to accomplish my goal?
Thanks!
Mark
I was able to solve my problem using Django's db function Extract (https://docs.djangoproject.com/en/3.1/ref/models/database-functions/#extract)
My DocumentCommentManager:
from django.db.models.functions import Extract
class DocumentCommentManager(models.Manager):
def get_queryset(self):
return super(DocumentCommentManager, self).get_queryset().annotate(created_year=Extract("created","year"))
This solves my original problem of adding a calculated datetime field to the model queries.
I still have not found a general way to add a calculated field to a model query using Q expressions. If you can share any examples, that would be great!

sql_cosntraints on exitsing field from original module

In product.template there is field default_code. Is it' possible to add sql_constraints that default code should be unique. Because this code doesn't work. Or do i need override default_code field in my costume module?
class ProductProduct(models.Model):
_inherit = 'product.template'
_sql_constraints = [
('code_uniq', 'unique (default_code)', "Default code already exists!"),
]
Please try with Python constrain may its useful for you :
import this lines in python file :
from openerp.exceptions import ValidationError
Any write this method in your class :
#api.constrains('default_code')
def _check_default_code(self):
code = self.search([('default_code','=',self.default_code)])
if len(code) > 1:
raise ValidationError(_("Duplicate Record"))
I would add the constraint on model product.product because that's where this information (product reference) really is used. But default_code on product.template will only work since Odoo V10. In Odoo V8 and V9 it was a unstored related field, so not in DB. So you have to add the constraint on model product.product.
class ProductProduct(models.Model):
_inherit = 'product.product'
_sql_constraints = [
('code_uniq', 'unique(default_code)', "Default code already exists!"),
]
Important to know: If the module, which sets up the constraint, is updated while the constraint will fail (e. g. the default_code actually twice in db), it won't create a sql constraint in db. So you have to clean up the data and update the module again or create the constraint in the db by yourself.

save trained model of Spark's Naive Bayes classificator

Somebody knows - is it possible to save trained model of Spark's Naive Bayes classificator (for example in text file), and load it in future if required?
Thank You.
I tried saving and loading the model. I was not able to recreate the model using the stored weights. ( Couldn't find the proper constructor ). But the whole model is serializable. So you can store and load it as follows :
store as :
val fos = new FileOutputStream(<storage path>)
val oos = new ObjectOutputStream(fos)
oos.writeObject(model)
oos.close
and load it in:
val fos = new FileInputStream(<storage path>)
val oos = new ObjectInputStream(fos)
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]
It worked for me
it is discussed in this thread :
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-td11953.html
You can use built-in functions (Spark version 2.1.0). Use NaiveBayesModel#save in order to store the model and NaiveBayesModel#load in order to read previously stored model.
Method save comes from Saveable and is implemented by wide range of classification models. Method load seems to be static in each classification model implementation.

Update all i18n fields from an action

What I'm trying to do is relatively simple but I can't find documentation.
Let's say I have a model Thing with a field label. The label field is internationalized.
How can I update all label fields from a model or an action?
(I'm using Doctrine)
You didn't say which ORM you're using so I assumed Doctrine.
You can update/set internationalized fields in the following way:
$thing = new Thing();
$thing->Translation['en']->label = 'My Label';
$thing->Translation['nl']->label = 'Mijn Label';
$thing->save();
Of course if your object is already persisted you have to retrieve it first.
Read more in symfony and doctrine docs:
http://www.symfony-project.org/jobeet/1_4/Doctrine/en/19#chapter_19_sub_doctrine_objects
http://www.doctrine-project.org/projects/orm/1.2/docs/manual/behaviors/en#core-behaviors:i18n

Resources