Huggingface saving tokenizer - huggingface-transformers

Huggingface saving tokenizer - huggingface-transformers

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
However, the last line is giving the error:
OSError: Can't load config for './models/tokenizer3/'. Make sure that:
- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer3/' is the correct path to a directory containing a config.json file
transformers version: 3.1.0
How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.
Edit 1
Thanks to #ashwin's answer below I tried save_pretrained instead, and I get the following error:
OSError: Can't load config for './models/tokenizer/'. Make sure that:
- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer/' is the correct path to a directory containing a config.json file
the contents of the tokenizer folder is below:
I tried renaming tokenizer_config.json to config.json and then I got the error:
ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).
To save the entire tokenizer, you should use save_pretrained()
Thus, as follows:
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
Edit:
for some unknown reason:
instead of
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
using
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
works.

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

Related

Issue in loading pre-trained models from HuggingFace

Based on this, I wrote the following code on Python file, which I want to run on my server
model_name = 'xlm-roberta-base'
access_token = "........................"
self.tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_auth_token=access_token)
self.model = transformers.AutoModel.from_pretrained(model_name, use_auth_token=access_token)
I provided my access_token properly and I checked with both read and write roles. Always, I am getting the following error
File "/home/.../anaconda3/envs/test/lib/python3.8/site-packages/transformers/utils/hub.py", line 424, in cached_file
raise EnvironmentError(
OSError: 5450 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
I did every step possible to me but still got the error. Where do I need to modify the code?

Conversion of TFGPT2LMHeadModel to pb file

I have trained a customized GPT2 model using TFGPT2LMHeadModel and it works fine.
However, I have to put it in production and the keras h5 file has a poor performance.
Consequently, the best option is to convert the h5 file to a pb file.
Tensorflow has already such feature:
from transformers import TFGPT2LMHeadModel
import tensorflow as tf
initial_model = TFGPT2LMHeadModel.from_pretrained(model_link)
tf.saved_model.save(initial_model, 'model_pb_folder')
Which works fine.
I get a pb file as expected.
However, I can't load it using the same TFGPT2 function:
pb_model = TFGPT2LMHeadModel.from_pretrained(model_pb_folder)
It has the following error:
OSError: Error no file named tf_model.h5 found in directory
C:/.../model_pb_folder but there is a file for PyTorch weights. Use
from_pt=True to load this model from those weights.
Is there a solution to use a TFGPT2 model with a pb file instead of h5 file?

Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.

I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.

In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main

bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do

Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.

you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

Neo.ClientError.Statement.ExternalResourceFailed on Mac

I have a CSV file I generate through code. I want to import the generated CSV file into neo4j using the following cypher query.
LOAD CSV WITH HEADERS FROM 'file:////Users/{user}/Desktop/neo4j-importer/tmp/temp_data.csv'
I have changed the following config varables
Commenting out dbms.directories.import=import.
And set dbms.security.allow_csv_import_from_file_urls=true
Problem is I get thrown the following error:
Neo.ClientError.Statement.ExternalResourceFailed:
Couldn't load the external resource at:
file:/Users/{user}/Library/Application%20Support/Neo4j%20Desktop/Application/neo4jDatabases/database-c517b267-220d-4b7a-be26-813d5b64a51a/installation-3.5.3/import/Users/{user}/Desktop/neo4j-importer/tmp/temp_data.csv
I mean it is partly right just not the /Users/{user}/Library/Application%20Support/Neo4j%20Desktop/Application/neo4jDatabases/database-c517b267-220d-4b7a-be26-813d5b64a51a/installation-3.5.3/import/ bit... Any suggestions on how to fix this weird file pathing problem?

Try changing the config setting to point to the directory with your imports:
dbms.directories.import=/Users/{user}/Desktop/neo4j-importer/tmp
and then changing the Cypher query to just specify the CSV file:
LOAD CSV WITH HEADERS FROM 'file:///temp_data.csv'
...

Pocketsphinx install does not contain acoustic model definition mdef

I have tried to install pocketsphinx 5 prealpha on windows. But it seems to be stuck in this error below.
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0 ERROR: "acmod.c", line 83: Folder 'model/en-us/en-us' does not contain acoustic model definition 'mdef'
My sphinxbase and pocketsphinx folder is in the same parent folder and I have renamed it as the instruction.
how I compile it
I have check all the directories and it did contain mdef file without extension.
What should I do?
Thankyou.

You need to specify a proper path to the model folder. You are currently in bin\Release\x64 folder. In your case the path to the model folder must be ..\..\..\model\en-us\en-us. If you are not sure what is the relative path, specify an absolute path.

This is because when you run the example code it has MODELDIR and DATADIR variables according to the default but you need to put them according to your file location. Changing the following might sort out the issue
MODELDIR = "/usr/local/share/pocketsphinx/model/"
DATADIR = "/my/Desktop/directory/pocketsphinx-master/test/data/"
This should work ! However I'm not sure. Do u have a better solution?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Huggingface saving tokenizer - huggingface-transformers

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

Related

Issue in loading pre-trained models from HuggingFace

Conversion of TFGPT2LMHeadModel to pb file

Load a pre-trained model from disk with Huggingface Transformers

Neo.ClientError.Statement.ExternalResourceFailed on Mac

Pocketsphinx install does not contain acoustic model definition mdef

Categories

Resources