local variable access by workers in julia - parallel-processing

Let's say I have some data :: Vector{Float64} and a function f!(data::Vector{Float64}, i::Int) that calculates some value from it, modifying it in the process.
answers = pmap([1,2,3,4]) do i
f!(data, i)
end
Is this safe to do? Does each worker have its own copy of data, or should I explicitly copy(data) on all workers?

That is safe to do and will create a closure that puts the data in the function and sends the function with the data to each process. If you use a CachingPool it will make sure that data is only sent to each worker once (in Julia v0.7 and 1.0 this will be done by default).

Related

stop specfic process in python ProcessPoolExecutor or shared state btw them

This is my code
def long_stage_task(node, deployment_folder_name, stage_s3_bucket):
global workers
logging.info("starting....")
work = StageOS(node, deployment_folder_name, stage_s3_bucket)--> class
work.stagestart()--> class method
executor = ProcessPoolExecutor(5)
executor.submit(long_stage_task, i, deployment_folder_name, stage_s3_bucket)
Now how can i stop a particular process/pid.
Is there any way to pass globals or shared state btw them, i don't see any thing in the doc.
https://docs.python.org/3/library/concurrent.futures.html
You could pass to the workers a list of Events and set them when you want the worker to stop. This implies your long_stage_task function periodically checks its own Event.
If what you are after is stopping a task which is taking too long, you can take a look at pebble. It allows to set timeouts to function calls as well as to cancel ongoing tasks.

Using alternative event loop without setting global policy

I'm using uvloop with websockets as
import uvloop
coro = websockets.serve(handler, host, port) # creates new server
loop = uvloop.new_event_loop()
loop.create_task(coro)
loop.run_forever()
It works fine, I'm just wondering whether I could run to some unexpected problems without setting the global asyncio policy to uvloop. As far as I understand, not setting the global policy should work as long as nothing down there doesn't use the global asyncio methods, but works with the passed-down event loop directly. Is that correct?
There are three main global objects in asyncio:
the policy (common to all threads)
the default loop (specific to the current thread)
the running loop (specific to the current thread)
All the attempts to get the current context in asyncio go through a single function, asyncio.get_event_loop.
One thing to remember is that since Python 3.6 (and Python 3.5.3+), get_event_loop has a specific behavior:
If it's called while a loop is running (e.g within a coroutine), the running loop is returned.
Otherwise, the default loop is returned by the policy.
Example 1:
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
loop = asyncio.get_event_loop()
loop.run_forever()
Here the policy is the uvloop policy. The loop returned by get_event_loop is a uvloop, and it is set as the default loop for this thread. When this loop is running, it is registered as the running loop.
In this example, calling get_event_loop() anywhere in this thread returns the right loop.
Example 2:
import uvloop
loop = uvloop.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_forever()
Here the policy is still the default policy. The loop returned by new_event_loop is a uvloop, and it is set as the default loop for this thread explicitly using asyncio.set_event_loop. When this loop is running, it is registered as the running loop.
In this example, calling get_event_loop() anywhere in this thread returns the right loop.
Example 3:
import uvloop
loop = uvloop.new_event_loop()
loop.run_forever()
Here the policy is still the default policy. The loop returned by new_event_loop is a uvloop, but it is not set as the default loop for this thread. When this loop is running, it is registered as the running loop.
In this example, calling get_event_loop() within a coroutine returns the right loop (the running uvloop). But calling get_event_loop() outside a coroutine will result in a new standard asyncio loop, set as the default loop for this thread.
So the first two approaches are fine, but the third one is discouraged.
Custom event loop should be passed as param
If you want to use custom event loop without using asyncio.set_event_loop(loop), you'll have to pass loop as param to every relevant asyncio coroutines or objects, for example:
await asyncio.sleep(1, loop=loop)
or
fut = asyncio.Future(loop=loop)
You may notice that probably any coroutine/object from asyncio module accepts this param.
Same thing is also applied to websockets library as you may see from it's source code. So you'll need to write:
loop = uvloop.new_event_loop()
coro = websockets.serve(handler, host, port, loop=loop) # pass loop as param
There's no guarantee that your program would work fine if you won't pass your event loop as param like that.
Possible, but uncomfortable
While theoretically you can use some event loop without changing policy I find it's extremely uncomfortable.
You'll have to write loop=loop almost everywhere, it's annoying
There's no guarantee that some third-party would allow you to pass
loop as param and won't just use asyncio.get_event_loop()
Base on that I advice you to reconsider your decision and use global event loop.
I understand that it may be felt "unright" to use global event loop, but "right" way is to pass loop as param everywhere is worse on practice (in my opinion).

Celery worker variable sharing & initialization in bootstep

I have a question regarding process shared variables in python 3.4 and celery 4.0.2. I already read a post (Celery worker variable sharing issues) where the goal of the poster was to not share the variable.
I currently have the exact opposite of this problem: I want to share a variable across all subprocesses of a single worker.
My situation is as follows:
i'm using prefork on my celery worker (running under ubuntu)
i created a bootstep where i retrieve data from a configuration server and store the dict in a global variable thats placed in its own module. the bootstep is a worker bootstep and its only dependency is the timer. i do the (blocking) request in the bootsteps start method. i can see that it receives the correct data
sometimes it "just works" i see the global variable from my worker subprocesses. sometimes it does not work, the variable is initialized but the dict inside it where the data should be is empty
i already did some debugging and found out that in fact the ids of the objects that should hold my config dict are the same (which suggests that the process was copied after the configuration variable was first accessed
config.py in module config
class MyGlobalConfig(object):
def __init__(self):
self.data = {}
def get(key):
return self.data[key]
global_config = MyGlobalConfig()
bootstep.py
import config
...
global_config.data = response_data
...
some.py accessed from a task
import config
...
global_config.get(key) # NoneType cause global_config.data is empty
...
i have no idea why it works sometimes and sometimes not. from what i've seen celery forks its subprocesses after the worker bootsteps have finished - so in theory my data should be there.
apart from these startup issues it is totally possible that i get config updates during the lifetime of the worker and i need to distribute this to all subprocesses as well.
any ideas whats the best way do this in celery? everything i found so far is either worker centric or utilizes the broker. since it should only be for the local workers processe i don't want to use anything that could affect other non local workers or utilizes the broker ...

How to make python's socketIO-client libraries wait(seconds=1) non blocking?

I'm using the socketIO-client 0.6.5 for a python client that communicates with a node server that uses socketio. My problem is, in order for a client listener to receive data from the server, i have to use the wait() method. wait() hangs the program infinitely while wait(seconds=) hangs the program for the no.of seconds.
I'm using this for a game where the listener will be executed inside the game loop continuously, but if i use the wait() method, the game loop is going to be stuck for the number of seconds which i can't have. The code is way too huge for me to put here but i'm putting a snippet that is representative of the actual code.
def main():
sc = Sock_Con()
while(True):
sc.push_player_location(2,3)
socketIO.on('get_player_location', sc.on_player_location)
socketIO.wait(seconds=1)
if i don't use the wait() method, the data never gets picked from the client. If i do use it the program hangs for the number of seconds. Is there something i'm missing or is there a workaround?

Passing success and failure handlers to an ActiveJob

I have an ActiveJob that's supposed to load a piece of data from an external system over HTTP. When that job completes, I want to queue a second job that does some postprocessing and then submits the data to a different external system.
I don't want the first job to know about the second job, because
encapsulation
reusability
it's none of the first job's business, basically
Likewise, I don't want the first job to care what happens next if the data-loading fails -- maybe the user gets notified, maybe we retry after a timeout, maybe we just log it and throw up our hands -- again it could vary based on the details of the exception, and there's no need for the job to include the logic for that or the connections to other systems to handle it.
In Java (which is where I have the most experience), I could use something like Guava's ListenableFuture to add success and failure callbacks after the fact:
MyDataLoader loader = new MyDataLoader(someDataSource)
ListenableFuture<Data> future = executor.submit(loader);
Futures.addCallback(future, new FutureCallback<Data>() {
public void onSuccess(Data result) {
processData(result);
}
public void onFailure(Throwable t) {
handleFailure(t);
}
});
ActiveJob, though, doesn't seem to provide this sort of external callback mechanism -- as best I can make out from relevant sections in "Active Job Basics", after_perform and rescue_from are only meant to be called from within the job class. And after_peform isn't meant to distinguish between success and failure.
So the best I've been able to come up with (and I'm not claiming it's very good) is to pass a couple of lambdas into the job's perform method, thus:
class MyRecordLoader < ActiveJob::Base
# Loads data expensively (hopefully on a background queue) and passes
# the result, or any exception, to the appropriate specified lambda.
#
# #param data_source [String] the URL to load data from
# #param on_success [-> (String)] A lambda that will be passed the record
# data, if it's loaded successfully
# #param on_failure [-> (Exception)] A lambda that will be passed any
# exception, if there is one
def perform(data_source, on_success, on_failure)
begin
result = load_data_expensively_from data_source
on_success.call(result)
rescue => exception
on_failure.call(exception)
end
end
end
(Side note: I have no idea what the yardoc syntax is for declaring lambdas as parameters. Does this look correct, or, failing that, plausible?)
The caller would then have to pass these in:
MyRecordLoader.perform_later(
some_data_source,
method(:process_data),
method(:handle_failure)
)
That's not terrible, at least on the calling side, but it seems clunky, and I can't help but suspect there's a common pattern for this that I'm just not finding. And I'm somewhat concerned that, as a Ruby/Rails novice, I'm just bending ActiveJob to do something it was never meant to do in the first place. All the ActiveJob examples I'm finding are 'fire and forget' -- asynchronously "returning" a result doesn't seem to be an ActiveJob use case.
Also, it's not clear to me that this will work at all in the case of a back-end like Resque that runs the jobs in a separate process.
What's "the Ruby way" to do this?
Update: As hinted at by dre-hh, ActiveJob turned out not to be the right tool here. It was also unreliable, and overcomplicated for the situation. I switched to Concurrent Ruby instead, which fits the use case better, and which, since the tasks are mostly IO-bound, is fast enough even on MRI, despite the GIL.
ActiveJob is not an async Library like a future or promise.
It is just an interface for performing tasks in a background. The current thread/process receives no result of this operation.
For example when using Sidekiq as ActiveJob queue, it will serialize the parameters of the perform method into the redis store. Another daemon process running within the context of your rails app will be watching the redis queue and instantiate your worker with the serialized data.
So passing callbacks might be alright, however why having them as methods on another class. Passing callbacks would make sense if those are dynamic (changing on different invocation). However as you have them implemented on the calling class, consider just moving those methods into your job worker class.

Resources