I'm attempting to serve Tensorflow models out of HDFS using Tensorflow Serving project.
I'm running tensorflow serving docker container tag 1.10.1
https://hub.docker.com/r/tensorflow/serving
I can see tensorflow/serving repo referencing Hadoop at
https://github.com/tensorflow/serving/blob/628702e1de1fa3d679369e9546e7d74fa91154d3/tensorflow_serving/model_servers/BUILD#L341
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system"
This is a reference to
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.cc
I have set the following environmental variables:
HADOOP_HDFS_HOME to point to my HDFS home (/etc/hadoop in my case).
MODEL_BASE_PATH set to "hdfs://tensorflow/models"
MODEL_NAME set to name of model I wish to load
I mount Hadoop home into docker container and can verify it using docker exec.
When I run the docker container I get the following in logs:
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:369] FileSystemStoragePathSource encountered a file-system access error: Could not find base path hdfs://tensorflow/models/my_model for servable my_model
I have found examples of Tensorflow doing training using HDFS, but not serving models from HDFS using Tensorflow Serving.
Can Tensorflow Serving serve models from HDFS?
If so, how do you do this?
In BUILD of model_servers, under the cc_test for get_model_status_impl_test, add this line #org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system, as shown below:
cc_test(
name = "get_model_status_impl_test",
size = "medium",
srcs = ["get_model_status_impl_test.cc"],
data = [
"//tensorflow_serving/servables/tensorflow/testdata:saved_model_half_plus_two_2_versions",
],
deps = [
":get_model_status_impl",
":model_platform_types",
":platform_config_util",
":server_core",
"//tensorflow_serving/apis:model_proto",
"//tensorflow_serving/core:availability_preserving_policy",
"//tensorflow_serving/core/test_util:test_main",
"//tensorflow_serving/servables/tensorflow:saved_model_bundle_source_adapter_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_config_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_source_adapter_proto",
"//tensorflow_serving/test_util",
"#org_tensorflow//tensorflow/cc/saved_model:loader",
"#org_tensorflow//tensorflow/cc/saved_model:signature_constants",
"#org_tensorflow//tensorflow/contrib/session_bundle",
"#org_tensorflow//tensorflow/core:test",
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
],
)
I think this would solve your problem.
Ref: Fail to load the models from HDFS
Related
I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar
I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.
From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.
I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.
In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main
bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do
Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.
you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.
According to documentation, it says that the default download directory for all Keras files is $HOME/.keras. I'm using virtual environment and I want to change the default download directory of pre-trained models to a different directory. Maybe this has more to do with virtualenv than with Keras?
If you are using the master branch of keras, you can set the KERAS_HOME environment variable to set the cache directory. If it is not set, cache directory defaults to $HOME/.keras.
export KERAS_HOME="/path/to/keras/dir"
Add the line to your ".bashrc" to set the variable every time you open up a new terminal.
This has not yet been released, you must use the master branch to use this feature.
According to documentation
Signature: ResNet50(include_top=True, weights='imagenet',
input_tensor=None, input_shape=None, pooling=None, classes=1000)
There's no parameter to specify where to download the pre-trained model weights.
(1) What you can do is to move the file to where you want it to be after the download from your terminal using mv (https://www.macworld.com/article/2080814/master-the-command-line-copying-and-moving-files.html).
UPDATE: I went to check the github repo of Keras (https://github.com/keras-team/keras/blob/master/keras/applications/resnet50.py) and found the link to the weights. For resnet:
WEIGHTS_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
WEIGHTS_PATH_NO_TOP = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'
You can download those weights directly to your file system using whatever methods (i.e. urllib).
You can copy the model fiel *.h5 that has been downloaded via other ways into the Keras default models directory ~/keras/models.
I have an ALS model that I train on one spark installation. I persist as so:
model.save(sc, './recommender_model')
On the filesystem it looks like this:
$ find ./recommender_model
./recommender_model
./recommender_model/metadata
./recommender_model/metadata/_SUCCESS
./recommender_model/metadata/._SUCCESS.crc
./recommender_model/metadata/.part-00000.crc
./recommender_model/metadata/part-00000
./recommender_model/data
./recommender_model/data/product
./recommender_model/data/product/part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_common_metadata
./recommender_model/data/product/.part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/.part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/_SUCCESS
./recommender_model/data/product/._metadata.crc
./recommender_model/data/product/._SUCCESS.crc
./recommender_model/data/product/._common_metadata.crc
./recommender_model/data/product/part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_metadata
./recommender_model/data/user
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/.part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/.part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/._metadata.crc
./recommender_model/data/user/part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/_metadata
I would like to move this folder to another spark installation so that I train on one spark installation and use another spark installation for predictions.
Can I simply tar up this folder and unpack it to the other spark instance where I can load the model? E.g.
model = MatrixFactorizationModel.load(sc, './recommender_model')
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = model.predictAll(my_movie)
individual_movie_rating_RDD.collect()
I performed some basic testing and the approach of zipping up the whole folder and unzipping it on another machine worked well for me.
I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.
Questions
How do I use Amazon S3 as checkpoint directory ?Is it just enough to call
ssc.checkpoint(amazons3url) ?
Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?
From the answer in the link
Solution 1:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)
Set the checkpoint directory as S3 URL -
s3n://spark-streaming/checkpoint
And then launch your spark application using spark submit.
This works in spark 1.4.2
solution 2:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
StreamingContext.getOrCreate(checkPointDir, () => {
createStreamingContext(checkPointDir, config)
}, hadoopConf)
To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit method
s3n://<aws-access-key>:<aws-secret-key>#<s3-bucket>/<prefix ...>
Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon