stop specfic process in python ProcessPoolExecutor or shared state btw them - multiprocessing

This is my code
def long_stage_task(node, deployment_folder_name, stage_s3_bucket):
global workers
logging.info("starting....")
work = StageOS(node, deployment_folder_name, stage_s3_bucket)--> class
work.stagestart()--> class method
executor = ProcessPoolExecutor(5)
executor.submit(long_stage_task, i, deployment_folder_name, stage_s3_bucket)
Now how can i stop a particular process/pid.
Is there any way to pass globals or shared state btw them, i don't see any thing in the doc.
https://docs.python.org/3/library/concurrent.futures.html

You could pass to the workers a list of Events and set them when you want the worker to stop. This implies your long_stage_task function periodically checks its own Event.
If what you are after is stopping a task which is taking too long, you can take a look at pebble. It allows to set timeouts to function calls as well as to cancel ongoing tasks.

Related

Using wait_for with timeouts with list of tasks

So, I have a list of tasks which I want to schedule concurrently in a non-blocking fashion.
Basically, gather should do the trick.
Like
tasks = [ asyncio.create_task(some_task()) in bleh]
results = await asyncio.gather(*tasks)
But then, I also need a timeout. What I want is that any task which takes > timeout time cancels and I proceed with what I have.
I fould asyncio.wait primitive.
https://docs.python.org/3/library/asyncio-task.html#waiting-primitives
But then the doc says:
Run awaitable objects in the aws set concurrently and block until the condition specified by return_when.
Which seems to suggest that it blocks...
It seems that asyncio.wait_for will do the trick
https://docs.python.org/3/library/asyncio-task.html#timeouts
But how do i send in the list of awaitables rather than just an awaitable?
What I want is that any task which takes > timeout time cancels and I proceed with what I have.
This is straightforward to achieve with asyncio.wait():
# Wait for tasks to finish, but no more than a second.
done, pending = await asyncio.wait(tasks, timeout=1)
# Cancel the ones not done by now.
for fut in pending:
fut.cancel()
# Results are available as x.result() on futures in `done`
Which seems to suggest that [asyncio.wait] blocks...
It only blocks the current coroutine, the same as gather or wait_for.

Celery worker variable sharing & initialization in bootstep

I have a question regarding process shared variables in python 3.4 and celery 4.0.2. I already read a post (Celery worker variable sharing issues) where the goal of the poster was to not share the variable.
I currently have the exact opposite of this problem: I want to share a variable across all subprocesses of a single worker.
My situation is as follows:
i'm using prefork on my celery worker (running under ubuntu)
i created a bootstep where i retrieve data from a configuration server and store the dict in a global variable thats placed in its own module. the bootstep is a worker bootstep and its only dependency is the timer. i do the (blocking) request in the bootsteps start method. i can see that it receives the correct data
sometimes it "just works" i see the global variable from my worker subprocesses. sometimes it does not work, the variable is initialized but the dict inside it where the data should be is empty
i already did some debugging and found out that in fact the ids of the objects that should hold my config dict are the same (which suggests that the process was copied after the configuration variable was first accessed
config.py in module config
class MyGlobalConfig(object):
def __init__(self):
self.data = {}
def get(key):
return self.data[key]
global_config = MyGlobalConfig()
bootstep.py
import config
...
global_config.data = response_data
...
some.py accessed from a task
import config
...
global_config.get(key) # NoneType cause global_config.data is empty
...
i have no idea why it works sometimes and sometimes not. from what i've seen celery forks its subprocesses after the worker bootsteps have finished - so in theory my data should be there.
apart from these startup issues it is totally possible that i get config updates during the lifetime of the worker and i need to distribute this to all subprocesses as well.
any ideas whats the best way do this in celery? everything i found so far is either worker centric or utilizes the broker. since it should only be for the local workers processe i don't want to use anything that could affect other non local workers or utilizes the broker ...

Passing success and failure handlers to an ActiveJob

I have an ActiveJob that's supposed to load a piece of data from an external system over HTTP. When that job completes, I want to queue a second job that does some postprocessing and then submits the data to a different external system.
I don't want the first job to know about the second job, because
encapsulation
reusability
it's none of the first job's business, basically
Likewise, I don't want the first job to care what happens next if the data-loading fails -- maybe the user gets notified, maybe we retry after a timeout, maybe we just log it and throw up our hands -- again it could vary based on the details of the exception, and there's no need for the job to include the logic for that or the connections to other systems to handle it.
In Java (which is where I have the most experience), I could use something like Guava's ListenableFuture to add success and failure callbacks after the fact:
MyDataLoader loader = new MyDataLoader(someDataSource)
ListenableFuture<Data> future = executor.submit(loader);
Futures.addCallback(future, new FutureCallback<Data>() {
public void onSuccess(Data result) {
processData(result);
}
public void onFailure(Throwable t) {
handleFailure(t);
}
});
ActiveJob, though, doesn't seem to provide this sort of external callback mechanism -- as best I can make out from relevant sections in "Active Job Basics", after_perform and rescue_from are only meant to be called from within the job class. And after_peform isn't meant to distinguish between success and failure.
So the best I've been able to come up with (and I'm not claiming it's very good) is to pass a couple of lambdas into the job's perform method, thus:
class MyRecordLoader < ActiveJob::Base
# Loads data expensively (hopefully on a background queue) and passes
# the result, or any exception, to the appropriate specified lambda.
#
# #param data_source [String] the URL to load data from
# #param on_success [-> (String)] A lambda that will be passed the record
# data, if it's loaded successfully
# #param on_failure [-> (Exception)] A lambda that will be passed any
# exception, if there is one
def perform(data_source, on_success, on_failure)
begin
result = load_data_expensively_from data_source
on_success.call(result)
rescue => exception
on_failure.call(exception)
end
end
end
(Side note: I have no idea what the yardoc syntax is for declaring lambdas as parameters. Does this look correct, or, failing that, plausible?)
The caller would then have to pass these in:
MyRecordLoader.perform_later(
some_data_source,
method(:process_data),
method(:handle_failure)
)
That's not terrible, at least on the calling side, but it seems clunky, and I can't help but suspect there's a common pattern for this that I'm just not finding. And I'm somewhat concerned that, as a Ruby/Rails novice, I'm just bending ActiveJob to do something it was never meant to do in the first place. All the ActiveJob examples I'm finding are 'fire and forget' -- asynchronously "returning" a result doesn't seem to be an ActiveJob use case.
Also, it's not clear to me that this will work at all in the case of a back-end like Resque that runs the jobs in a separate process.
What's "the Ruby way" to do this?
Update: As hinted at by dre-hh, ActiveJob turned out not to be the right tool here. It was also unreliable, and overcomplicated for the situation. I switched to Concurrent Ruby instead, which fits the use case better, and which, since the tasks are mostly IO-bound, is fast enough even on MRI, despite the GIL.
ActiveJob is not an async Library like a future or promise.
It is just an interface for performing tasks in a background. The current thread/process receives no result of this operation.
For example when using Sidekiq as ActiveJob queue, it will serialize the parameters of the perform method into the redis store. Another daemon process running within the context of your rails app will be watching the redis queue and instantiate your worker with the serialized data.
So passing callbacks might be alright, however why having them as methods on another class. Passing callbacks would make sense if those are dynamic (changing on different invocation). However as you have them implemented on the calling class, consider just moving those methods into your job worker class.

Spring #Async cancel and start?

I have a spring MVC app where a user can kick off a Report generation via button click. This process could take few minutes ~ 10-20 mins.
I use springs #Async annotation around the service call so that report generation happens asynchronously. While I pop a message to user indicating job is currently running.
Now What I want to do is, if another user (Admin) can kick off Report generation via the button which should cancel/stop currently running #Async task and restart the new task.
To do this, I call the
.. ..
future = getCurrentTask(id); // returns the current task for given report id
if (!future.isDone())
future.cancel(true);
service.generateReport(id);
How can make it so that "service.generateReport" waits while the future cancel task kills all the running threads?
According to the documentation, after i call future.cancel(true), isDone will return true as well as isCancelled will return true. So there is no way of knowing the job is actually cancelled.
I can only start new report generation when old one is cancelled or completed so that it would not dirty data.
From documentation about cancel() method,
Subsequent calls to isCancelled() will always return true if this method returned true
Try this.
future = getCurrentTask(id); // returns the current task for given report id
if (!future.isDone()){
boolean terminatedImmediately=future.cancel(true);
if(terminatedImmediately)
service.generateReport(id);
else
//Inform user existing job couldn't be stopped.And to try again later
}
Assuming the code above runs in thread A, and your recently cancelled report is running in thread B, then you need thread A to stop before service.generateReport(id) and wait until thread B is completes / cancelled.
One approach to achieve this is to use Semaphore. Assuming there can be only 1 report running concurrently, first create a semaphore object acccessible by all threads (normally on the report runner service class)
Semaphore semaphore = new Semaphore(1);
At any point on your code where you need to run the report, call the acquire() method. This method will block until a permit is available. Similarly when the report execution is finished / cancelled, make sure release() is called. Release method will put the permit back and wakes up other waiting thread.
semaphore.acquire();
// run report..
semaphore.release();

Running Plone subscriber events asynchronously

In using Plone 4, I have successfully created a subscriber event to do extra processing when a custom content type is saved. This I accomplished by using the Products.Archetypes.interfaces.IObjectInitializedEvent interface.
configure.zcml
<subscriber
for="mycustom.product.interfaces.IRepositoryItem
Products.Archetypes.interfaces.IObjectInitializedEvent"
handler=".subscribers.notifyCreatedRepositoryItem"
/>
subscribers.py
def notifyCreatedRepositoryItem(repositoryitem, event):
"""
This gets called on IObjectInitializedEvent - which occurs when a new object is created.
"""
my custom processing goes here. Should be asynchronous
However, the extra processing can sometimes take too long, and I was wondering if there is a way to run it in the background i.e. asynchronously.
Is it possible to run subscriber events asynchronously for example when one is saving an object?
Not out of the box. You'd need to add asynch support to your environment.
Take a look at plone.app.async; you'll need a ZEO environment and at least one extra instance. The latter will run async jobs you push into the queue from your site.
You can then define methods to be executed asynchronously and push tasks into the queue to execute such a method asynchronously.
Example code, push a task into the queue:
from plone.app.async.interfaces import IAsyncService
async = getUtility(IAsyncService)
async.queueJob(an_async_task, someobject, arg1_value, arg2_value)
and the task itself:
def an_async_task(someobject, arg1, arg2):
# do something with someobject
where someobject is a persistent object in your ZODB. The IAsyncService.queueJob takes at least a function and a context object, but you can add as many further arguments as you need to execute your task. The arguments must be pickleable.
The task will then be executed by an async worker instance when it can, outside of the context of the current request.
Just to give more options, you could try collective.taskqueue for that, really simple and really powerful (and avoid some of the drawbacks of plone.app.async).
The description on PyPI already has enough to get you up to speed in no time, and you can use redis for the queue management which is a big plus.

Resources