I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?
Related
A simple dask cache example. Cache does not work as expected. Let's assume we have a list of data and a series of delayed functions, expected that for a function that encounters the same input to cache/memoize the results according to cachey score.
This example demonstrates that is not the case.
import time
import dask
from dask.cache import Cache
from dask.diagnostics import visualize
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
def slow_func(x):
time.sleep(5)
return x+1
output = []
data = np.ones((100))
for x in data:
a = dask.delayed(slow_func)(x)
output.append(a)
total = dask.delayed(sum)(output)
cache = Cache(2e9)
cache.register()
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof,CacheProfiler() as cprof:
total.compute()
visualize([prof, rprof, cprof])
cache cprof plot
After the initial parallel execution of the function would expect the next iteration upon calling the function with the same value to use a cache version. But obviously does not, dask_key_name is for designating the same output, but i want to assess this function for a variety of inputs and if seeing the same input use cached version. We can tell if this is happening very easily with this function due to the 5 second delay and should see it execute roughly 5 seconds as soon as the first value is cached after execution. This example executes every single function delayed 5 seconds. I am able to create a memoized version using the cachey library directly but this should work using the dask.cache library.
In dask.delayed you may need to specify the pure=True keyword.
You can verify that this worked because all of your dask delayed values will have the same key.
You don't need to use Cache for this if they are all in the same dask.compute call.
I have a pandas DataFrame with about 45,000 rows similar to:
from numpy import random
from pandas import DataFrame
df = DataFrame(random.rand(45000, 200))
I am trying to break up all the rows into a multiprocessing Queue like this:
from multiprocessing import Queue
rows = [idx_and_row[1] for idx_and_row in df.iterrows()]
my_queue = Queue(maxsize = 0)
for idx, r in enumerate(rows):
# print(idx)
my_queue.put(r)
But when I run it, only about 37,000 things get put into my_queue and then it the program raises the following error:
raise Full
queue.Full
What is happening and how can I fix it?
The multiprocessing.Queue is designed for inter-process communication. It is not intended for storing large amount of data. For that purpose, I'd suggest to use Redis or Memcached.
Usually, the queue maximum size is platform dependent, even if you set it to 0. You have no easy way to workaround that.
It seems that on windows, the maximum amount of objects in a multiprocessing.Queue is infinite, but on Linux and MacOS the maximum size is 32767, which is 215 - 1, here is the significance of that number.
I solved the program by making an empty Queue object and then passing it to all the processes I wanted to pass it to, plus another process. The additional process is responsible for filling the queue with 10,000 rows, and checking it every few seconds to see if the queue has been emptied. When its empty, another 10,000 rows are added. This way, all 45,000 row is processed.
As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.
I have recently started working with pythons multiprocessing library, and decided that using the Pool() and apply_async() approach is the most suitable for my problem. The code is quite long, but for this question I've compressed everything that isn't related to the multiprocessing in functions.
Background information
Basically, my program is supposed to take some data structure and send it to another program that will process it and write the results to a txt file. I have several thousands of these structures (N*M), and there are big chunks (M) that are independent and can be processed in any order. I created a worker pool to process these M structures before retrieving the next chunk. In order to process one structure, a new thread has to be created for the external program to run. The time spend outside the external program during the processing is less than 20 %, so if I check Task Manager, I can see the external program running under processes.
Actual problem
This works very well for a while, but after many processed structures (any number between 5000 and 20000) suddenly the external program stop showing up in the Task Manager and the python children runs at their individual peak performance (~13% cpu) without producing any more results. I don't understand what the problem might be. There are plenty of RAM left, and each child only use around 90 Mb. Also it is really weird that it works for quite some time and then stops. If I use ctrl-c, it stops after a few minutes, so it is semi-irresponsive to user input.
One thought I had was that when the timed-out external program thread is killed (which happens every now and then), maybe something isn't closed properly so that the child process is waiting for something it cannot find anymore? And if so, is there any better way of handling timed-out external processes?
from multiprocessing import Pool, TimeoutError
N = 500 # Number of chunks of data that can be multiprocessed
M = 80 # Independed chunk of data
timeout = 100 # Higher than any of the value for dataStructures.timeout
if __name__ == "__main__":
results = [None]*M
savedData = []
with Pool(processes=4) as pool:
for iteration in range(N):
dataStructures = [generate_data_structure(i) for i in range(M)]
#---Process data structures---
for iS, dataStructure in enumerate(dataStructures):
results[iS] = pool.apply_async(processing_func,(dataStructure,))
#---Extract processed data---
for iR, result in enumerate(results):
try:
processedData = result.get(timeout=timeout)
except TimeoutError:
print("Got TimeoutError.")
if processedData.someBool:
savedData.append(processedData)
Here is also the functions that create the new thread for the external program.
import subprocess as sp
import win32api as wa
import threading
def processing_func(dataStructure):
# Call another program that processes the data, and wait until it is finished/timed out
timedOut = RunCmd(dataStructure.command).start_process(dataStructure.timeout)
# Read the data from the other program, stored in a text file
if not timedOut:
processedData = extract_data_from_finished_thread()
else:
processedData = 0.
return processedData
class RunCmd(threading.Thread):
CREATE_NO_WINDOW = 0x08000000
def __init__(self, cmd):
threading.Thread.__init__(self)
self.cmd = cmd
self.p = None
def run(self):
self.p = sp.Popen(self.cmd, creationflags=self.CREATE_NO_WINDOW)
self.p.wait()
def start_process(self, timeout):
self.start()
self.join(timeout)
timedOut = self.is_alive()
# Kills the thread if timeout limit is reached
if timedOut:
wa.TerminateProcess(self.p._handle,-1)
self.join()
return timedOut
i have weird caching problems with the 1.3 version of django. I probably have something configured wrong, but am not sure what.
A good example is django-avatar, which uses caching and many people use it. Even if I dont have a cache backend defined the avatar seems to be cached, which by itself would be ok, but it keeps switching back and forth between the last values cached. Example: I upload a new avatar, now on approximately 50% of the requests it will show me the new one, 50% the old one. If I delete the old one I still get it on the site 50% of the time. The only way to fix it is to disable the caching of the avatar by setting it to one second.
First I thought it was because i used django.core.cache.backends.locmem.LocMemCache, which I never used before, but it even happens when I dont configure a cache backend at all.
I found one similar bug:
Django caching bug .. even if caching is disabled
but my pages render just fine, its the templatetags (for now) that cause the problems in my setup.
I use django 1.3, postgres, nginx, gunicorn 0.12.0, greenlet==0.3.1, eventlet==0.9.16
I just did some more testing and realized that it only happens when I start gunicorn using the config file. If I start it with ./manage.py run_gunicorn everything is fine. Running "gunicorn_django -c deploy/gunicorn.conf.py" causes the problems.
The only explanation I can think of is that each worker gets his own cache (I wonder why, since I did not define a cache).
Update: running ./manage.py run_gunicorn -w 4 also causes the same problems. Therefore I am almost certain that the multiple workers are causing the problems and each worker caches the values seperately.
My configuration:
import os
import socket
import sys
PORT = 8000
PROC_NAME = 'myapp_gunicorn'
LOGFILE_NAME = 'gunicorn.log'
TIMEOUT = 3600
IP = '127.0.0.1'
DEPLOYMENT_ROOT = os.path.dirname(os.path.abspath(__file__))
SITE_ROOT = os.path.abspath(os.path.sep.join([DEPLOYMENT_ROOT, '..']))
CPU_CORES = os.sysconf("SC_NPROCESSORS_ONLN")
sys.path.insert(0, os.path.join(SITE_ROOT, "apps"))
bind = '%s:%s' % (IP, PORT)
logfile = os.path.sep.join([DEPLOYMENT_ROOT, 'logs', LOGFILE_NAME])
proc_name = PROC_NAME
timeout = TIMEOUT
worker_class = 'eventlet'
workers = 2 * CPU_CORES + 1
I also tried it without using 'eventlet', but got the same errors.
Thanks for any help.
It is most likely defaulting to the in-memory-cache, which means each worker has it's own version of the cache in it's own memory space. If you hit thread 1 you get a different cache then thread 3. Nginx is spreading the load between each thread most likely via a round robin distribution, so you are changing threads each hit. Which explains your wacky results.
When you do manage.py run_gunicorn it is most likely running single threaded, and thus only one cache, and that is why you don't see the same results.
Using memcached or something similar is the way to go.