Load a pre-trained model from disk with Huggingface Transformers - huggingface-transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.

I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.

In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main

bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do

Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.

you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

Related

Kedro context and catalog missing from Jupyter Notebook

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar
I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Huggingface saving tokenizer

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
However, the last line is giving the error:
OSError: Can't load config for './models/tokenizer3/'. Make sure that:
- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer3/' is the correct path to a directory containing a config.json file
transformers version: 3.1.0
How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.
Edit 1
Thanks to #ashwin's answer below I tried save_pretrained instead, and I get the following error:
OSError: Can't load config for './models/tokenizer/'. Make sure that:
- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer/' is the correct path to a directory containing a config.json file
the contents of the tokenizer folder is below:
I tried renaming tokenizer_config.json to config.json and then I got the error:
ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).
To save the entire tokenizer, you should use save_pretrained()
Thus, as follows:
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
Edit:
for some unknown reason:
instead of
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
using
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
works.
Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.
You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

How do I change the default download directory for pre-trained model in Keras?

According to documentation, it says that the default download directory for all Keras files is $HOME/.keras. I'm using virtual environment and I want to change the default download directory of pre-trained models to a different directory. Maybe this has more to do with virtualenv than with Keras?
If you are using the master branch of keras, you can set the KERAS_HOME environment variable to set the cache directory. If it is not set, cache directory defaults to $HOME/.keras.
export KERAS_HOME="/path/to/keras/dir"
Add the line to your ".bashrc" to set the variable every time you open up a new terminal.
This has not yet been released, you must use the master branch to use this feature.
According to documentation
Signature: ResNet50(include_top=True, weights='imagenet',
input_tensor=None, input_shape=None, pooling=None, classes=1000)
There's no parameter to specify where to download the pre-trained model weights.
(1) What you can do is to move the file to where you want it to be after the download from your terminal using mv (https://www.macworld.com/article/2080814/master-the-command-line-copying-and-moving-files.html).
UPDATE: I went to check the github repo of Keras (https://github.com/keras-team/keras/blob/master/keras/applications/resnet50.py) and found the link to the weights. For resnet:
WEIGHTS_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
WEIGHTS_PATH_NO_TOP = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'
You can download those weights directly to your file system using whatever methods (i.e. urllib).
You can copy the model fiel *.h5 that has been downloaded via other ways into the Keras default models directory ~/keras/models.

Pocketsphinx install does not contain acoustic model definition mdef

I have tried to install pocketsphinx 5 prealpha on windows. But it seems to be stuck in this error below.
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0 ERROR: "acmod.c", line 83: Folder 'model/en-us/en-us' does not contain acoustic model definition 'mdef'
My sphinxbase and pocketsphinx folder is in the same parent folder and I have renamed it as the instruction.
how I compile it
I have check all the directories and it did contain mdef file without extension.
What should I do?
Thankyou.
You need to specify a proper path to the model folder. You are currently in bin\Release\x64 folder. In your case the path to the model folder must be ..\..\..\model\en-us\en-us. If you are not sure what is the relative path, specify an absolute path.
This is because when you run the example code it has MODELDIR and DATADIR variables according to the default but you need to put them according to your file location. Changing the following might sort out the issue
MODELDIR = "/usr/local/share/pocketsphinx/model/"
DATADIR = "/my/Desktop/directory/pocketsphinx-master/test/data/"
This should work ! However I'm not sure. Do u have a better solution?

Redprob.ado file installation

I’m trying to use the dynamic RE panel logit model.
I downloaded the redprob.ado, redprob.hlp, redpmod_ll.ado files from Prof. Mark Stewart's website and installed as follows.
Create ado and personal folder in my C: drive.
Save those 3 files in personal folder.
Typed net set ado c:\ado\personal
Typed adopath + "c:\ado\personal”
Type ssc install redprob
But the following message came out.
ssc install: "redprob" not found at SSC, type -findit redprob-
(To find all packages at SSC that start with r, type -ssc describe r-)”
What is the way to install and use redprob?
As you noted these files should be downloaded from Mark Stewart's site, namely http://www2.warwick.ac.uk/fac/soc/economics/staff/academic/stewart/stata
The third file is called redpmod_ll.ado, and not as you originally typed.
They are not on SSC, so the ssc command is completely irrelevant. What you did is like saying "fetch me this program from Warwick" and then "and now fetch me this program from Connecticut" when it is only at Warwick.
The StataCorp recommendation is to use c:\ado\plus, but what you did should have worked.
Try it with
which redprob
and Stata should be able to find the program and tell you where it is.
If you have manually downloaded the files, simply pasting them into c:\ado\personal should do the work. Stata will automatically load commands from this folder when it starts. Create the folder if it doesn't exist.

Resources