how to move a spark model from one spark installation to another? - apache-spark-mllib

I have an ALS model that I train on one spark installation. I persist as so:
model.save(sc, './recommender_model')
On the filesystem it looks like this:
$ find ./recommender_model
./recommender_model
./recommender_model/metadata
./recommender_model/metadata/_SUCCESS
./recommender_model/metadata/._SUCCESS.crc
./recommender_model/metadata/.part-00000.crc
./recommender_model/metadata/part-00000
./recommender_model/data
./recommender_model/data/product
./recommender_model/data/product/part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_common_metadata
./recommender_model/data/product/.part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/.part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/_SUCCESS
./recommender_model/data/product/._metadata.crc
./recommender_model/data/product/._SUCCESS.crc
./recommender_model/data/product/._common_metadata.crc
./recommender_model/data/product/part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_metadata
./recommender_model/data/user
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/.part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/.part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/._metadata.crc
./recommender_model/data/user/part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/_metadata
I would like to move this folder to another spark installation so that I train on one spark installation and use another spark installation for predictions.
Can I simply tar up this folder and unpack it to the other spark instance where I can load the model? E.g.
model = MatrixFactorizationModel.load(sc, './recommender_model')
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = model.predictAll(my_movie)
individual_movie_rating_RDD.collect()

I performed some basic testing and the approach of zipping up the whole folder and unzipping it on another machine worked well for me.

Related

Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.
I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.
In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main
bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do
Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.
you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

Data versioning of "Hello_World" tutorial

i have added "versioned: true" in the "catalog.yml" file of the "hello_world" tutorial.
example_iris_data:
type: pandas.CSVDataSet
filepath: data/01_raw/iris.csv
versioned: true
Then when I used
"kedro run" to run the tutorial, it has error as below:
"VersionNotFoundError: Did not find any versions for CSVDataSet".
May i know what is the right way for me to do versioning for the "iris.csv" file? thanks!
Try versioning one of the downstream outputs. For example, add this entry in your catalog.yml, and run kedro run
example_train_x:
type: pandas.CSVDataSet
filepath: data/02_intermediate/example_iris_data.csv
versioned: true
And you will see example_iris.data.csv directory (not a file) under data/02_intermediate. The reason example_iris_data gives you an error is that it's the starting data and there's already iris.csv in data/01_raw so, Kedro cannot create data/01_raw/iris.csv/ directory because of the name conflict with the existing iris.csv file.
Hope this helps :)
The reason for the error is that when Kedro tries to load the dataset, it looks for a file in data/01_raw/iris.csv/<load_version>/iris.csv and, of course, cannot find such path. So if you really want to enable versioning for your input data, you can move iris.csv like:
mv data/01_raw/iris.csv data/01_raw/iris.csv_tmp
mkdir data/01_raw/iris.csv
mv data/01_raw/iris.csv_tmp data/01_raw/iris.csv/<put_some_timestamp_here>/iris.csv
You wouldn't need to do that for any intermediate data as this path manipulations are done by Kedro automatically when it saves a dataset (but not on load).

Can I run stanza NER without downloading the language modules?

I need to run stanza ner in a platform without any access to external network. The code stanza.download('en') fails. Running without the download function, gives me an exception
Exception: Resources file not found at: \home\stanza_resources\resources.json. Try to download the model again
Is there a way to download and cache all the required modules in a resource directory and point this directory to stanza pipeline?
Thanks
It looks like both download and the Pipeline class take an argument for directory dir
So the below code works
stanza.download('en', dir='resources/', processors={ner_processor: package})
nlp_pipeline = stanza.Pipeline('en', dir='resources/', processors={ner_processor: package})

How do I configure Tensorflow Serving to serve models from HDFS?

I'm attempting to serve Tensorflow models out of HDFS using Tensorflow Serving project.
I'm running tensorflow serving docker container tag 1.10.1
https://hub.docker.com/r/tensorflow/serving
I can see tensorflow/serving repo referencing Hadoop at
https://github.com/tensorflow/serving/blob/628702e1de1fa3d679369e9546e7d74fa91154d3/tensorflow_serving/model_servers/BUILD#L341
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system"
This is a reference to
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.cc
I have set the following environmental variables:
HADOOP_HDFS_HOME to point to my HDFS home (/etc/hadoop in my case).
MODEL_BASE_PATH set to "hdfs://tensorflow/models"
MODEL_NAME set to name of model I wish to load
I mount Hadoop home into docker container and can verify it using docker exec.
When I run the docker container I get the following in logs:
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:369] FileSystemStoragePathSource encountered a file-system access error: Could not find base path hdfs://tensorflow/models/my_model for servable my_model
I have found examples of Tensorflow doing training using HDFS, but not serving models from HDFS using Tensorflow Serving.
Can Tensorflow Serving serve models from HDFS?
If so, how do you do this?
In BUILD of model_servers, under the cc_test for get_model_status_impl_test, add this line #org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system, as shown below:
cc_test(
name = "get_model_status_impl_test",
size = "medium",
srcs = ["get_model_status_impl_test.cc"],
data = [
"//tensorflow_serving/servables/tensorflow/testdata:saved_model_half_plus_two_2_versions",
],
deps = [
":get_model_status_impl",
":model_platform_types",
":platform_config_util",
":server_core",
"//tensorflow_serving/apis:model_proto",
"//tensorflow_serving/core:availability_preserving_policy",
"//tensorflow_serving/core/test_util:test_main",
"//tensorflow_serving/servables/tensorflow:saved_model_bundle_source_adapter_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_config_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_source_adapter_proto",
"//tensorflow_serving/test_util",
"#org_tensorflow//tensorflow/cc/saved_model:loader",
"#org_tensorflow//tensorflow/cc/saved_model:signature_constants",
"#org_tensorflow//tensorflow/contrib/session_bundle",
"#org_tensorflow//tensorflow/core:test",
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
],
)
I think this would solve your problem.
Ref: Fail to load the models from HDFS

How do I change the default download directory for pre-trained model in Keras?

According to documentation, it says that the default download directory for all Keras files is $HOME/.keras. I'm using virtual environment and I want to change the default download directory of pre-trained models to a different directory. Maybe this has more to do with virtualenv than with Keras?
If you are using the master branch of keras, you can set the KERAS_HOME environment variable to set the cache directory. If it is not set, cache directory defaults to $HOME/.keras.
export KERAS_HOME="/path/to/keras/dir"
Add the line to your ".bashrc" to set the variable every time you open up a new terminal.
This has not yet been released, you must use the master branch to use this feature.
According to documentation
Signature: ResNet50(include_top=True, weights='imagenet',
input_tensor=None, input_shape=None, pooling=None, classes=1000)
There's no parameter to specify where to download the pre-trained model weights.
(1) What you can do is to move the file to where you want it to be after the download from your terminal using mv (https://www.macworld.com/article/2080814/master-the-command-line-copying-and-moving-files.html).
UPDATE: I went to check the github repo of Keras (https://github.com/keras-team/keras/blob/master/keras/applications/resnet50.py) and found the link to the weights. For resnet:
WEIGHTS_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
WEIGHTS_PATH_NO_TOP = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'
You can download those weights directly to your file system using whatever methods (i.e. urllib).
You can copy the model fiel *.h5 that has been downloaded via other ways into the Keras default models directory ~/keras/models.

Resources