dask cache delayed function example - caching

A simple dask cache example. Cache does not work as expected. Let's assume we have a list of data and a series of delayed functions, expected that for a function that encounters the same input to cache/memoize the results according to cachey score.
This example demonstrates that is not the case.
import time
import dask
from dask.cache import Cache
from dask.diagnostics import visualize
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
def slow_func(x):
time.sleep(5)
return x+1
output = []
data = np.ones((100))
for x in data:
a = dask.delayed(slow_func)(x)
output.append(a)
total = dask.delayed(sum)(output)
cache = Cache(2e9)
cache.register()
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof,CacheProfiler() as cprof:
total.compute()
visualize([prof, rprof, cprof])
cache cprof plot
After the initial parallel execution of the function would expect the next iteration upon calling the function with the same value to use a cache version. But obviously does not, dask_key_name is for designating the same output, but i want to assess this function for a variety of inputs and if seeing the same input use cached version. We can tell if this is happening very easily with this function due to the 5 second delay and should see it execute roughly 5 seconds as soon as the first value is cached after execution. This example executes every single function delayed 5 seconds. I am able to create a memoized version using the cachey library directly but this should work using the dask.cache library.

In dask.delayed you may need to specify the pure=True keyword.
You can verify that this worked because all of your dask delayed values will have the same key.
You don't need to use Cache for this if they are all in the same dask.compute call.

Related

MacOS: Why does Multiprocessing Queue.put stop working?

I have a pandas DataFrame with about 45,000 rows similar to:
from numpy import random
from pandas import DataFrame
df = DataFrame(random.rand(45000, 200))
I am trying to break up all the rows into a multiprocessing Queue like this:
from multiprocessing import Queue
rows = [idx_and_row[1] for idx_and_row in df.iterrows()]
my_queue = Queue(maxsize = 0)
for idx, r in enumerate(rows):
# print(idx)
my_queue.put(r)
But when I run it, only about 37,000 things get put into my_queue and then it the program raises the following error:
raise Full
queue.Full
What is happening and how can I fix it?
The multiprocessing.Queue is designed for inter-process communication. It is not intended for storing large amount of data. For that purpose, I'd suggest to use Redis or Memcached.
Usually, the queue maximum size is platform dependent, even if you set it to 0. You have no easy way to workaround that.
It seems that on windows, the maximum amount of objects in a multiprocessing.Queue is infinite, but on Linux and MacOS the maximum size is 32767, which is 215 - 1, here is the significance of that number.
I solved the program by making an empty Queue object and then passing it to all the processes I wanted to pass it to, plus another process. The additional process is responsible for filling the queue with 10,000 rows, and checking it every few seconds to see if the queue has been emptied. When its empty, another 10,000 rows are added. This way, all 45,000 row is processed.

How can I handle TensorFlow sessions to train multiple Keras models at the same time?

I need to train multiple Keras models at the same time. I'm using TensorFlow backend. Problem is, when I try to train, say, two models at the same time, I get Attempting to use uninitialized value.
The error is not really relevant, the main problem seems to be that Keras is forcing the two models to be created in the same session with the same graph so it conflicts.
I am a newbie in TensorFlow but my gut feeling is that the answer is pretty straightforward : you would have to create a different session for each Keras model and train them in their own session. Could someone explain me how it would be done ?
I really hope it is possible to solve this problem while still using Keras and not coding everything in pure TensorFlow. Any workaround would be appreciated too.
You are right, Keras automatically works with the default session.
You could use tf.compat.v1.keras.backend.get_session() or tf.compat.v1.keras.backend.set_session(sess) to manually set the global Keras session (see documentation).
For instance:
sess1 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess1)
# Train your first Keras model here ...
sess2 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess2)
# Train your second Keras model here ...
I train multiple models in parallel by using pythons multiprocessing, https://docs.python.org/3.4/library/multiprocessing.html.
I have a function that takes two parameters, an input queue and an output queue, this function runs in each process. The function has the following structure:
def worker(in_queue, out_queue):
import keras
while True:
parameters = in_queue.get()
network_parameters = parameters[0]
train_inputs = parameters[1]
train_outputs = parameters[2]
test_inputs = parameters[3]
test_outputs = parameters[4]
build the network based on the given parameters
train the network
test the network if required
out_queue.put(result)
From the main python script start as many processes (and create as many in and out queues) as required. Add jobs to a worker by calling put on its in queue and get the results by calling get on its out queue.

python 3 requests_futures requests to same server in different processes

I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?

While Recording in Squish using python, how to set the application to sleep for sometime between 2 consecutive activities?

I am working on an application where there are read only screens.
To test whether the data is being fetched on screen load, i want to set some wait time till the screen is ready.
I am using python to record the actions. Is there a way to check the static text on the screen and set the time ?
You can simply use
snooze(time in s).
Example:
snooze(5)
If you want to wait for a certain object, use
waitForObject(":symbolic_name")
Example:
type(waitForObject(":Welcome.Button"), )
The problem is more complicated if your objects are created dynamically. As my app does. In this case, you should create a while function that waits until the object exists. Here, maybe this code helps you:
def whileObjectIsFalse(objectID):
# objectID = be the symbolic name of your object.
counter = 300
objectState = object.exists(objectID)
while objectState == False:
objectState = object.exists(objectID)
snooze(0.1)
counter -= 1
if counter == 0:
return False
snooze(0.2)
In my case, even if I use snooze(), doesn't work all the time, because in some cases i need to wait 5 seconds, in other 8 or just 2. So, presume that your object is not created and tests this for 30 seconds.
If your object is not created until then, then the code exits with False, and you can tests this to stop script execution.
If you're using python, you can use time.sleep() as well

python slow to check if mongodb record found

I have a python (3.2) request that goes to MongoDB and the request itself is running fast enough. When I then perform an if statement check to see if any records were found it takes 50 times as long:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
58 27623 6475988 234.4 1.7 itemInDB = db.mainData.find({"x":item[x]}).limit(1)
59
60 #existing item in db
61 27623 293419802 10622.3 77.6 if itemInDB.count():
What on earth is the cause for that if statement taking so long?! I presume there must be a better way to check if a record was found but google has come up empty.
Thanks for the help.
Perhaps a Better Way
If you're only interested in returning one value, you might want to use find_one instead of find. It will stop looking for values after one has been found, as opposed to find, which has to run through the collection:
itemInDB = db.mainData.find_one({"x":item[x]})
if itemInDB:
print("Item found")
else:
print("Item not found")
For Your Example
According to the PyMongo docs, when querying the count of a cursor, you can pass in a parameter (True or False) to take into account any skip or limit calls previously made to the cursor. The default for that parameter is False (namely, not taking those calls into account). That may be affecting the performance of your count query.
Gauging Query Performance
If you want to see how your query will be carried out by mongo, you can call explain on your cursor:
db.coll.find({"x":4}).explain()
The explain function is also implemented in PyMongo.
Turns out it was due to the find() function and not the if statement. I created an index on "x" (as I should have anyway). Changed the find to find_one and removed the .count() from the if statement. Overall 75% faster.

Resources