Has anyone gotten SpaCy 2.0 to work in AWS Lambda? I have everything zipped and packaged correctly, since I can get a generic string to return from my lambda function if I test it. But when I do the simple function below to test, it stalls for about 10 seconds and then returns empty, and I don't get any error messages. I did set my Lambda timeout at 60 seconds so that isn't the problem.
import spacy
nlp = spacy.load('en_core_web_sm') #model package included
def lambda_handler(event, context):
doc = nlp(u'They are')
msg = doc[0].lemma_
return msg
When I load the model package without using it, it also returns empty, but if I comment it out it sends me the string as expected, so it has to be something about loading the model.
import spacy
nlp = spacy.load('en_core_web_sm') #model package included
def lambda_handler(event, context):
msg = 'message returned'
return msg
To optimize model load you have to store it on S3, and download it using your own script to tmp folder in lambda and then load it into spacy from it.
It will take 5 seconds to download it from S3 and run. The good optimization here is to keep model on warm container and check if it was already downloaded. On warm container code takes 0.8 seconds to run.
Here is the link to the code and package with example:
https://github.com/ryfeus/lambda-packs/blob/master/Spacy/source2.7/index.py
import spacy
import boto3
import os
def download_dir(client, resource, dist, local='/tmp', bucket='s3bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
if result.get('Contents') is not None:
for file in result.get('Contents'):
if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
resource.meta.client.download_file(bucket, file.get('Key'), local + os.sep + file.get('Key'))
def handler(event, context):
client = boto3.client('s3')
resource = boto3.resource('s3')
if (os.path.isdir("/tmp/en_core_web_sm")==False):
download_dir(client, resource, 'en_core_web_sm', '/tmp','ryfeus-spacy')
spacy.util.set_data_path('/tmp')
nlp = spacy.load('/tmp/en_core_web_sm/en_core_web_sm-2.0.0')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text, token.pos_, token.dep_)
return 'finished'
P.S. To package spacy within AWS Lambda you have to strip shared libraries.
Knew it was probably going to be something simple. The answer is that there wasn't enough allocated memory to run the Lambda function - I found that I had to minimally increase it to near the max 2816 MB to get the example above to work. It is notable that before last month it wasn't possible to go this high:
https://aws.amazon.com/about-aws/whats-new/2017/11/aws-lambda-doubles-maximum-memory-capacity-for-lambda-functions/
I turned it up to the max of 3008 MB to handle more text and everything seems to work just fine now.
What worked for me was cding into <YOUR_ENV>/lib/Python<VERSION>/site-packages/ and removing the language models I didn't need. For example, I only needed the English language model so once in my own site-packages directory I just needed to run als -d */ | grep -v en | xargs rm -rf`, and then zip up the contents to get it under Lambda's limits.
Related
I'm using go pprof like this:
go tool pprof -no_browser -http=0.0.0.0:8081 http://localhost:6060/debug/pprof/profile?seconds=60
How can I ask pprof to fetch the profiling data periodically?
Here's a python script that uses wget to grab the data every hour, putting the output into a file whose name includes the timestamp.
Each file can be inspected by running
go tool pprof pprof_data_YYYY_MM_DD_HH
Here's the script:
import subprocess
import time
from datetime import datetime
while True:
now = datetime.now()
sleepTime = 3601 - (60 * now.minute + now.second + 1e-6 * now.microsecond)
time.sleep(sleepTime)
now = datetime.now()
tag = f"{now.year}-{now.month:02d}-{now.day:02d}_{now.hour:02d}"
subprocess.run(["wget", "-O", f"pprof_data_{tag}", "-nv", "-o", "/dev/null", "http://localhost:6060/debug/pprof/profile?seconds=60"])
The 3601 causes wget to run about 1 second after the top of the hour to avoid the race condition that time.Sleep returns just before the top of the hour.
You could obviously write a similar script in bash or your favorite language.
I am using pexpect to connect to a remote server using ssh.
The following code works but I have to use time.sleep to make a delay.
Especially when I am sending a command to run a script on the remote server.
The script will take up to a minute to run and if I don't use a 60 seconds delay, then the script will end prematurely.
The same issue when I am using sftp to download a file. If the file is large, then it download partially.
Is there a way to control without using a delay?
#!/usr/bin/python3
import pexpect
import time
from subprocess import call
siteip = "131.235.111.111"
ssh_new_conn = 'Are you sure you want to continue connecting'
password = 'xxxxx'
child = pexpect.spawn('ssh admin#' + siteip)
time.sleep(1)
child.expect('admin#.* password:')
child.sendline('xxxxx')
time.sleep(2)
child.expect('admin#.*')
print('ssh to abcd - takes 60 seconds')
child.sendline('backuplog\r')
time.sleep(50)
child.sendline('pwd')
Many pexpect functions take an optional timeout= keyword, and the one you give in spawn() sets the default. Eg
child.expect('admin#',timeout=70)
You can use the value None to never timeout.
I created the following Python 3.5 script:
import sys
from pathlib import Path
def test_unlink():
log = Path('log.csv')
fails = 0
num_tries = int(sys.argv[1])
for i in range(num_tries):
try:
log.write_text('asdfasdfasdfasdfasdfasdfasdfasdf')
with log.open('r') as logfile:
lines = logfile.readlines()
# Check the second line to account for the log file header
assert len(lines) == 1
log.unlink()
not log.exists()
except PermissionError:
sys.stdout.write('! ')
sys.stdout.flush()
fails += 1
assert fails == 0, '{:%}'.format(fails / num_tries)
test_unlink()
and run it like this: python test.py 10000. On Windows 7 Pro 64 with Python 3.5.2, the failure rate is not 0: it is small, but non-zero. Sometimes, it is not even that small: 5%! If you printout the exception, it will be this:
PermissionError: [WinError 5] Access is denied: 'C:\\...\\log.csv'
but it will sometimes occur at the exists(), other times at the write_text(), and I wouldn't be surprised if it happens at the unlink() and the open() too.
Note that the same script, same Python (3.5.2), but on linux (through http://repl.it/), does not have this issue: failure rate is 0.
I realize that a possible workaround could be:
while True:
try: log.unlink()
except PermissionError: pass
else: break
but this is tedious and error prone (several methods on Path instance would need this, easy to forget), and should (IMHO) not be necessary, so I don't think it is practical solution.
So, does anyone have an explanation for this, and a practical workaround, maybe a mode flag somewhere that can be set when Python starts?
I would like to save/read numpy arrays from/to worker machines (function) to HDFS efficiently in PySpark. I have two machines A and B. A has the master and worker. B has one worker. For e.g. I would like to achieve something as below:
if __name__ == "__main__":
conf = SparkConf().setMaster("local").setAppName("Test")
sc = SparkContext(conf = conf)
sc.parallelize([0,1,2,3], 2).foreachPartition(func)
def func(iterator):
P = << LOAD from HDFS or Shared Memory as numpy array>>
for x in iterator:
P = P + x
<< SAVE P (numpy array) to HDFS/ shared file system >>
What can be a fast and efficient method for this?
I stumbled upon the same problem. and finally used a workaround using the HdfsCli module and tempfiles with Python3.4.
imports:
from hdfs import InsecureClient
from tempfile import TemporaryFile
create a hdfs client. In most cases, it is better to have a utility function somewhere in your script, like this one:
def get_hdfs_client():
return InsecureClient("<your webhdfs uri>", user="<hdfs user>",
root="<hdfs base path>")
load and save your numpy inside a worker function:
hdfs_client = get_hdfs_client()
# load from file.npy
path = "/whatever/hdfs/file.npy"
tf = TemporaryFile()
with hdfs_client.read(path) as reader:
tf.write(reader.read())
tf.seek(0) # important, set cursor to beginning of file
np_array = numpy.load(tf)
...
# save to file.npy
tf = TemporaryFile()
numpy.save(tf, np_array)
tf.seek(0) # important ! set the cursor to the beginning of the file
# with overwrite=False, an exception is thrown if the file already exists
hdfs_client.write("/whatever/output/file.npy", tf.read(), overwrite=True)
Notes:
the uri used to create the hdfs client begins with http://, because it uses the web interface of the hdfs file system;
ensure that the user you pass to the hdfs client has read and write permissions
in my experience, the overhead is not significant (at least in term of execution time)
the advantage of using tempfiles (vs regular files in /tmp) is that you ensure no garbage files stay in the cluster machines after the script ends, normally or not
In the past, I have created an instance with attached EBS storage through the AWS web console. At the "Step4. Add storage" step I would add EBS storage as device="/dev/sdf", Standard as Volume type and no Snapshot. Once the instance got launched, I would issue the following set of commands to mount the extra drive as a separate directory and make it accessible to everybody:
sudo mkfs.ext4 /dev/xvdf
sudo mkdir /home/foo/extra_storage_directory
sudo mount -t ext4 /dev/xvdf /home/foo/extra_storage_directory
cd /home/foo
sudo chmod a+w extra_storage_directory
aI was given a piece of python code that creates instances without any extra storage programmatically. It calls boto.ec2.connection.run_instances. I need to modify this code to be able to create instances with extra storage. I need to essentially emulate the manual steps I used doing it via console, to make sure that the above sudo commands work after I launch the new instance.
Which boto function(s) do I need to use and how to add the storage?
UPDATE: I did some digging and wrote some code that I thought was supposed to do what I wanted. However, the behavior is a bit strange. Here's what I have:
res = state.connection.run_instances(state.ami,key_name=state.key,instance_type=instance_type,security_groups=sg)
inst = res.instances[0]
pmt = inst.placement
time.sleep(60)
try:
vol = state.connection.create_volume(GB, pmt)
tsleep = 60
time.sleep(tsleep)
while True:
vstate = vol.status
if not vstate == 'available':
print "volume state is %s, trying again after %d secs" % (vstate,tsleep)
time.sleep(tsleep)
else:
break
print "Attaching vol %s to inst %s" % (str(vol.id),str(inst.id))
state.connection.attach_volume(vol.id, inst.id, "/dev/sdf")
print "attach_volume OK"
except Exception as e:
print "Exception: %s" % str(e)
The call to run_instances came from the original code that I need to modify. After the volume get created, when I looked at its status in the AWS console, I see available. However, I get an endless sequence of
volume state is creating, trying again after 60 secs
Why the difference?
As garnaat pointed out, I did have to use vol.update() to update the volume status. So the code below does what I need:
res = state.connection.run_instances(state.ami,key_name=state.key,instance_type=instance_type,security_groups=sg)
inst = res.instances[0]
pmt = inst.placement
time.sleep(60)
try:
vol = state.connection.create_volume(GB, pmt)
tsleep = 60
time.sleep(tsleep)
while True:
vol.update()
vstate = vol.status
if not vstate == 'available':
print "volume state is %s, trying again after %d secs" % (vstate,tsleep)
time.sleep(tsleep)
else:
break
print "Attaching vol %s to inst %s" % (str(vol.id),str(inst.id))
state.connection.attach_volume(vol.id, inst.id, "/dev/sdf")
print "attach_volume OK"
except Exception as e:
print "Exception: %s" % str(e)
I tripped on the same problem and the answer at How to launch EC2 instance with Boto, specifying size of EBS? had the solution.
Here are the relevant links:
Python Boto documentation - block_device_map
API reference - BlockDeviceMapping.N
Command line reference - -b, --block-device-mapping mapping
CLI reference - --block-device-mappings (list)
Important note: While in the Web Console the "Delete on Termination" check box is checked, in the Boto API, it's the opposite, delete_on_termination=False by default!