Conversion of TFGPT2LMHeadModel to pb file - performance

I have trained a customized GPT2 model using TFGPT2LMHeadModel and it works fine.
However, I have to put it in production and the keras h5 file has a poor performance.
Consequently, the best option is to convert the h5 file to a pb file.
Tensorflow has already such feature:
from transformers import TFGPT2LMHeadModel
import tensorflow as tf
initial_model = TFGPT2LMHeadModel.from_pretrained(model_link)
tf.saved_model.save(initial_model, 'model_pb_folder')
Which works fine.
I get a pb file as expected.
However, I can't load it using the same TFGPT2 function:
pb_model = TFGPT2LMHeadModel.from_pretrained(model_pb_folder)
It has the following error:
OSError: Error no file named tf_model.h5 found in directory
C:/.../model_pb_folder but there is a file for PyTorch weights. Use
from_pt=True to load this model from those weights.
Is there a solution to use a TFGPT2 model with a pb file instead of h5 file?

Related

Loading pre-trained BERT model error - Error no file named ['pytorch_model.bin', 'tf_model.h5'] found

I think my problem here is due to transformer version mismatch... but I would like some help with this...
Previously I used the huggingface library to perform language model fine tuning. This takes a corpus, an existing BERT model, and fine tune that model using this corpus. My command was
python run_language_modelling.py --output_dir=lm_finetune --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=thread0_wdc20.txt --do_eval --eval_data_file=wiki.test.raw --mlm --save_total_limit=1 --save_steps=2 --line_by_line --num_train_epochs=2
I fine-tuned the models successfully, and this created a folder that contained the following files:
checkpoint-183236 config.json eval_results_lm.txt lm_finetune pytorch_model.bin special_tokens_map.json tokenizer_config.json training_args.bin vocab.txt
And I also successfully loaded this fine-tuned language model for downstream tasks.
The problem is that I don't remember the versions of the libraries I used to do all these - pytorch, transformers, tensorflow...
Recently, I am experimenting something that required me to re-install these libraries. Their versions are now:
tensorflow-gpu 2.2.0
transformers 3.0.2
pytorch 1.4.0
torchtext 0.5.0
And when I use this environment to reload those previously fine-tuned language models, I get this error:
File "/home/li1zz/.conda/envs/tensorflow-gpu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/li1zz/.conda/envs/tensorflow-gpu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/li1zz/wop_matching/src/exp/run_bert_standard.py", line 313, in <module>
bert_model = transformers.TFBertModel.from_pretrained(bert_model)
File "/home/li1zz/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/transformers/modeling_tf_utils.py", line 437, in from_pretrained
[WEIGHTS_NAME, TF2_WEIGHTS_NAME], pretrained_model_name_or_path
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5'] found in directory /home/li1zz/bert/lm_finetune_proddesc/lm_finetune_part00 or `from_pt` set to False
Obviously, the file that is missing now is
tf_model.h5
I don't understand how I had this error - the fine-tuned models worked for sure before. And the only thing I can think of is the version mismatch. I.e., I fine-tuned those models using a version of the libraries that are incompatible with the ones I am now using, as one file is missing.
Can anyone provide some insights to this? Am I using wrong versions of the libraries? How can I fix this without re-doing all the language model finetuning using this new environment again?
I have discovered that the solution should be adding 'from_pt=True' in the line that loads the model:
bert_model = transformers.TFBertModel.from_pretrained(bert_model, from_pt=True)
This is actually hinted in the error message. And it should be an added feature because I clearly remember that my previous code was working and it did not have this parameter.
The files you have mentioned above indicate that you have trained a PyTorch model (pytorch_model.bin), but in your own answer you try to load a TensorFlow model with:
bert_model = transformers.TFBertModel.from_pretrained(bert_model, from_pt=True)
As you have already figured out, you can create a TensorFlow model from a PyTorch state_dict by setting from_pt=True. But in case it does not matter for you if you use PyTorch or TensorFlow, you could initialize a PyTorch model right away with:
bert_model = transformers.BertModel.from_pretrained(bert_model)
I have not tried it, but I assume that this will be faster. The documentation also mentions that it is faster to convert the PyTorch state_dict with one of the provided conversion scripts and loading the model afterward as using from_pt=True.

Huggingface saving tokenizer

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
However, the last line is giving the error:
OSError: Can't load config for './models/tokenizer3/'. Make sure that:
- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer3/' is the correct path to a directory containing a config.json file
transformers version: 3.1.0
How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.
Edit 1
Thanks to #ashwin's answer below I tried save_pretrained instead, and I get the following error:
OSError: Can't load config for './models/tokenizer/'. Make sure that:
- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer/' is the correct path to a directory containing a config.json file
the contents of the tokenizer folder is below:
I tried renaming tokenizer_config.json to config.json and then I got the error:
ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).
To save the entire tokenizer, you should use save_pretrained()
Thus, as follows:
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
Edit:
for some unknown reason:
instead of
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
using
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
works.
Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.
You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.
I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.
In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main
bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do
Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.
you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

DISTILBERT_BASE_UNCASED failed to load - Hugging face transformers

I am trying to execute :
import ktrain
from ktrain import text
MODEL_NAME='distilbert-base-uncased'
t=text.Transformer(MODEL_NAME, maxlen=500, classes=np.unique(y_train))
I am getting the following error:
*OSError: Model name 'distilbert-base-uncased' was not found in tokenizers model name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, distilbert-base-cased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). We assumed 'distilbert-base-uncased' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.*
ktrain version 0.14.6
transformers version 2.8.0
Libraries was installed using pip install.
Any help would be appreciated.

GUI module not found

I'm trying to run the following code:
# To use interactive plots (mouse clicks, zooming, panning) we use the nbagg back end. We want our graphs
# to be embedded in the notebook, inline mode, this combination is defined by the magic "%matplotlib notebook".
%matplotlib notebook
import SimpleITK as sitk
%run update_path_to_download_script
from downloaddata import fetch_data as fdata
import gui
# Using an external viewer (ITK-SNAP or 3D Slicer) we identified a visually appealing window-level setting
T1_WINDOW_LEVEL = (1050,500)
When I run it in spider 3.2.6 I get:
ModuleNotFoundError: No module named 'gui'
Any help would be appreciated.
Code source: http://insightsoftwareconsortium.github.io/SimpleITK-Notebooks/Python_html/30_Segmentation_Region_Growing.html
This isn't a spyder issue.
The gui module is part of the notebooks repository. Either clone the repository or just download this file. Same goes for the downloaddata module.

Resources