Jupyterhub custom spawner start long delays - python-asyncio

My custom spawner connects via ssh to a slurm submit node on user's behalf and submits a slurm job.
All of that takes a long time, around 10 seconds if the job can start straight away, which is expected, but I want the user to be redirected to a progress page immediately.
Instead there is a 10 second hang between the user pressing the "start" button and the progress page. It looks like the Jupyterhub waits for the start method to complete before redirecting.
The start method does the following:
await for asyncssh connection
await for slurm job to be submitted
await for a job status to be "Running".
So there seems to be a lot of opportunities for Jupyterhub to do other things while the start method is running.

It looks like the issue was related to my spawner using an options_form. Options form causes the spawning process to make a POST request, and in JupyterHub 1.1 POST spawning doesn't go to a pending page.
This behavior is fixed in the current master branch:
https://github.com/jupyterhub/jupyterhub/commit/3908c6d041987e69db7150dcf2041916053b863d

Related

Handling multiple Rest Services in ACTIVITI Process

I am completely new to Spring and Activiti, and made myself a little Project which works just fine. there are 4 service tasks, a REST controller, 1 process, 1 service and 4 methods in this service.
When i call the server endpoint, i start my process and it just goes step by step through my service tasks and calls service.method as defined in the expression ${service.myMethod()}.
BUT, what i really need is a workflow that stops after a servicecall and waits till there is another request sent, similar to a user task waiting for input, the whole process should pause till i send a request to another endpoint.
like myurl:8080/startprocess, and maybe the next day myurl:8080/continueprocess. Maybe even save some data for continued use.
is there a simple predefined way to do this?
Best regards
You can use human tasks for that or use a "signal intermediate catching event" (see activiti´s user guide) after each ativity.
When you do that, the first rest call will start a new process instance that will execute your flow´s activities until it reaches a the signal element. When this happens, the engine saves its current state and returns control to the caller.
In order to make you flow progress, you have to send it a "signal", something you can do with an API call or using the rest API (see item 15.6.2 in the guide)

How long can a Worker Role process set status to "busy" before getting killed?

I have a worker role process that want to stop processing new requests when it's too busy (e.g. CPU load > 80%, long disk queue, or some other metrics).
If I set the role status to "busy", will it get killed by Fabric Controller after busying for too long time? If yes, how long will it takes until the Fabric Controller kill the process?
I assume the process is still capable to receive/send signals to the Fabric agent.
Thanks!
You can leave an instance in the Busy status forever. The only time Azure will take recovery action is if the process exits. See http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx for some additional information.
Also, what is your worker role doing? Setting the instance status to Busy will only take it out of the load balancer rotation so that new incoming TCP connections will not get routed to that instance. But if your worker role is a typical worker role where it does background jobs (ie. sits in a loop picking messages up from a queue, or listening on an InternalEndpoint for requests coming from a front end web role) then setting it to Busy will have no effect. In this scenario you would add logic to your code to stop doing work, but what that looks like will depend on what type of work your role is doing.

Delayed_job going into a busy loop when lodging second task

I am running delayed_job for a few background services, all of which, until recently, run in isolation e.g. send an email, write a report etc.
I now have a need for one delayed_job, as its last step, to lodge another delayed_job.
delay.deploy() - when delayed_job runs this, it triggers a deploy action, the last step of which is to ...
delay.update_status() - when delayed_job runs this job, it will check the status of the deploy we started. If the deploy is still progressing, we call delay.update_status() again, if the deploy has stopped we write the final deploy status to a db record.
Step 1 works fine - after 5 seconds, delayed_job fires up the deploy, which starts the deployment, and then calls delay.update_status().
But here,
instead of update_status() starting up in 5 seconds, delayed_job goes into a busy loop, firing of a bunch of update_status calls, and looping really hard without pause.
I can see the logs filling up with all these calls, the server slows down, until the end-condition for update_status is reached (deploy has eventually succeeded or failed), and things get quiet again.
Am I using Delayed_Job::delay() incorrectly, am I missing a basic tenent of this use-case ?
OK it turns out this is "expected behaviour" - if you are already in the code running for a delayed_job, and you call .delay() again, without specifying a delay, it will run immediately. You need to add the parameter run_at:
delay(queue: :deploy, run_at: 10.seconds.from_now).check_status
See the discussion in google groups

How to guarantee a long operation completes

Normally, billings should execute in the background on a scheduled date (I haven't figured out how to do that yet, but that's another topic).
But occasionally, the user may wish to execute a billing manually. Once clicked, I would like to be sure the operation runs to completion regardless of what happens on the user side (e.g. closes browser, machine dies, network goes down, whatever).
I'm pretty sure db.SaveChanges() wraps its DB operations in a transaction, so from a server perspective I believe the whole thing will either finish or fail, with no partial effect.
But what about all the work between the POST and the db.SaveChanges()? Is there a way to be sure the user can't inadvertently or intentionally stop that from completing?
I guess a corollary to this question is what happens to a running Asynchronous Controller or a running Task or Thread if the user disconnects?
My previous project was actually doing a billing system in MVC. I distinctly remember testing out what would happen if I used Task and then quickly exited the site. It did all of the calculations just fine, ran a stored procedure in SQL Server, and sent me an e-mail when it was done.
So, to answer your question: If you wrap the operations in a Task it should finish anyways with no problems.

async execution of tasks for a web application

A web application I am developing needs to perform tasks that are too long to be executed during the http request/response cycle. Typically, the user will perform the request, the server will take this request and, among other things, run some scripts to generate data (for example, render images with povray).
Of course, these tasks can take a long time, so the server should not hang for the scripts to complete execution before sending the response to the client. I therefore need to perform the execution of the scripts async, and give the client a "the resource is here, but not ready" and probably tell it a ajax endpoint to poll, so it can retrieve and display the resource when ready.
Now, my question is not relative to the design (although I would very much enjoy any hints on this regard as well). My question is: does a system to solve this issue already exists, so I do not reinvent the square wheel ? If I had to, I would use a process queue manager to submit the task and put a HTTP endpoint to shoot out the status, something like "pending", "aborted", "completed" to the ajax client, but if something similar already exists specifically for this task, I would mostly enjoy it.
I am working in python+django.
Edit: Please note that the main issue here is not how the server and the client must negotiate and exchange information about the status of the task.
The issue is how the server handles the submission and enqueue of very long tasks. In other words, I need a better system than having my server submit scripts on LSF. Not that it would not work, but I think it's a bit too much...
Edit 2: I added a bounty to see if I can get some other answer. I checked pyprocessing, but I cannot perform submission of a job and reconnect to the queue at a later stage.
You should avoid re-inventing the wheel here.
Check out gearman. It has libraries in a lot of languages (including python) and is fairly popular. Not sure if anyone has any out of the box ways to easily connect up django to gearman and ajax calls, but it shouldn't be do complicated to do that part yourself.
The basic idea is that you run the gearman job server (or multiple job servers), have your web request queue up a job (like 'resize_photo') with some arguments (like '{photo_id: 1234}'). You queue this as a background task. You get a handle back. Your ajax request is then going to poll on that handle value until it's marked as complete.
Then you have a worker (or probably many) that is a separate python process connect up to this job server and registers itself for 'resize_photo' jobs, does the work and then marks it as complete.
I also found this blog post that does a pretty good job summarizing it's usage.
You can try two approachs:
To call webserver every n interval and inform a job id; server processes and return some information about current execution of that task
To implement a long running page, sending data every n interval; for client, that HTTP request will "always" be "loading" and it needs to collect new information every time a new data piece is received.
About second option, you can to learn more by reading about Comet; Using ASP.NET, you can do something similiar by implementing System.Web.IHttpAsyncHandler interface.
I don't know of a system that does it, but it would be fairly easy to implement one's own system:
create a database table with jobid, jobparameters, jobresult
jobresult is a string that will hold a pickle of the result
jobparameters is a pickled list of input arguments
when the server starts working on a job, it creates a new row in the table, and spwans a new process to handle that, passing that process the jobid
the task handler process updates the jobresult in the table when it has finished
a webpage (xmlrpc or whatever you are using) contains a method 'getResult(jobid)' that will check the table for a jobresult
if it finds a result, it returns the result, and deletes the row from the table
otherwise it returns an empty list, or None, or your preferred return value to signal that the job is not finished yet
There are a few edge-cases to take care of so an existing framework would clearly be better as you say.
At first You need some separate "worker" service, which will be started separately at powerup and communicated with http-request handlers via some local IPC like UNIX-socket(fast) or database(simple).
During handling request cgi ask from worker state or other data and replay to client.
You can signal that a resource is being "worked on" by replying with a 202 HTTP code: the Client side will have to retry later to get the completed resource. Depending on the case, you might have to issue a "request id" in order to match a request with a response.
Alternatively, you could have a look at existing COMET libraries which might fill your needs more "out of the box". I am not sure if there are any that match your current Django design though.
Probably not a great answer for the python/django solution you are working with, but we use Microsoft Message Queue for things just like this. It basically runs like this
Website updates a database row somewhere with a "Processing" status
Website sends a message to the MSMQ (this is a non blocking call so it returns control back to the website right away)
Windows service (could be any program really) is "watching" the MSMQ and gets the message
Windows service updates the database row with a "Finished" status.
That's the gist of it anyways. It's been quite reliable for us and really straight forward to scale and manage.
-al
Another good option for python and django is Celery.
And if you think that Celery is too heavy for your needs then you might want to look at simple distributed taskqueue.

Resources