Using BERT (distilled) model in celery/redis task on heroku - heroku

I am using a distilled bert model to perform question answering
https://huggingface.co/distilbert-base-cased-distilled-squad
Locally it all works fine, but when trying to use the model in my celery task, the worker thread logs show
2020-09-24T03:50:05.794556+00:00 app[worker.1]: 72%|███████▏ | 187M/261M [00:14<00:04, 15.5MB/s]
2020-09-24T03:50:19.779019+00:00 heroku[worker.1]: Process running mem=652M(123.2%)
2020-09-24T03:50:19.781651+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
The code below is where I my import, model loading and task takes place. Removing the imports and loading resolves the memory issue.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
#app.task
def evaluateQuestionAnswer():
I'm wondering if I need to use something like tensorflow serving
https://www.tensorflow.org/tfx/guide/serving
Or am I missing something about making models production ready?
Would using docker alleviate this loading problem?

Related

Speed up keras load model

I am generating some NN using keras. After the network is trained, I simply save it as:
model.save('model.h5')
Then, I load it as:
from keras.models import load_model
model = load_model("model.h5")
Everything works fine, however the loading takes up to 15 seconds. Given that I am trying to load and use the model on a lambda function with AWS where there is a 30 seconds time out, this is an issue. Is there anyway of speeding up the loading process?

How to speed up basic pyspark statements

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

How can I handle TensorFlow sessions to train multiple Keras models at the same time?

I need to train multiple Keras models at the same time. I'm using TensorFlow backend. Problem is, when I try to train, say, two models at the same time, I get Attempting to use uninitialized value.
The error is not really relevant, the main problem seems to be that Keras is forcing the two models to be created in the same session with the same graph so it conflicts.
I am a newbie in TensorFlow but my gut feeling is that the answer is pretty straightforward : you would have to create a different session for each Keras model and train them in their own session. Could someone explain me how it would be done ?
I really hope it is possible to solve this problem while still using Keras and not coding everything in pure TensorFlow. Any workaround would be appreciated too.
You are right, Keras automatically works with the default session.
You could use tf.compat.v1.keras.backend.get_session() or tf.compat.v1.keras.backend.set_session(sess) to manually set the global Keras session (see documentation).
For instance:
sess1 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess1)
# Train your first Keras model here ...
sess2 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess2)
# Train your second Keras model here ...
I train multiple models in parallel by using pythons multiprocessing, https://docs.python.org/3.4/library/multiprocessing.html.
I have a function that takes two parameters, an input queue and an output queue, this function runs in each process. The function has the following structure:
def worker(in_queue, out_queue):
import keras
while True:
parameters = in_queue.get()
network_parameters = parameters[0]
train_inputs = parameters[1]
train_outputs = parameters[2]
test_inputs = parameters[3]
test_outputs = parameters[4]
build the network based on the given parameters
train the network
test the network if required
out_queue.put(result)
From the main python script start as many processes (and create as many in and out queues) as required. Add jobs to a worker by calling put on its in queue and get the results by calling get on its out queue.

python 3 requests_futures requests to same server in different processes

I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?

Django1.3 multiple gunicorn workers caching problems

i have weird caching problems with the 1.3 version of django. I probably have something configured wrong, but am not sure what.
A good example is django-avatar, which uses caching and many people use it. Even if I dont have a cache backend defined the avatar seems to be cached, which by itself would be ok, but it keeps switching back and forth between the last values cached. Example: I upload a new avatar, now on approximately 50% of the requests it will show me the new one, 50% the old one. If I delete the old one I still get it on the site 50% of the time. The only way to fix it is to disable the caching of the avatar by setting it to one second.
First I thought it was because i used django.core.cache.backends.locmem.LocMemCache, which I never used before, but it even happens when I dont configure a cache backend at all.
I found one similar bug:
Django caching bug .. even if caching is disabled
but my pages render just fine, its the templatetags (for now) that cause the problems in my setup.
I use django 1.3, postgres, nginx, gunicorn 0.12.0, greenlet==0.3.1, eventlet==0.9.16
I just did some more testing and realized that it only happens when I start gunicorn using the config file. If I start it with ./manage.py run_gunicorn everything is fine. Running "gunicorn_django -c deploy/gunicorn.conf.py" causes the problems.
The only explanation I can think of is that each worker gets his own cache (I wonder why, since I did not define a cache).
Update: running ./manage.py run_gunicorn -w 4 also causes the same problems. Therefore I am almost certain that the multiple workers are causing the problems and each worker caches the values seperately.
My configuration:
import os
import socket
import sys
PORT = 8000
PROC_NAME = 'myapp_gunicorn'
LOGFILE_NAME = 'gunicorn.log'
TIMEOUT = 3600
IP = '127.0.0.1'
DEPLOYMENT_ROOT = os.path.dirname(os.path.abspath(__file__))
SITE_ROOT = os.path.abspath(os.path.sep.join([DEPLOYMENT_ROOT, '..']))
CPU_CORES = os.sysconf("SC_NPROCESSORS_ONLN")
sys.path.insert(0, os.path.join(SITE_ROOT, "apps"))
bind = '%s:%s' % (IP, PORT)
logfile = os.path.sep.join([DEPLOYMENT_ROOT, 'logs', LOGFILE_NAME])
proc_name = PROC_NAME
timeout = TIMEOUT
worker_class = 'eventlet'
workers = 2 * CPU_CORES + 1
I also tried it without using 'eventlet', but got the same errors.
Thanks for any help.
It is most likely defaulting to the in-memory-cache, which means each worker has it's own version of the cache in it's own memory space. If you hit thread 1 you get a different cache then thread 3. Nginx is spreading the load between each thread most likely via a round robin distribution, so you are changing threads each hit. Which explains your wacky results.
When you do manage.py run_gunicorn it is most likely running single threaded, and thus only one cache, and that is why you don't see the same results.
Using memcached or something similar is the way to go.

Resources