Delayed_job going into a busy loop when lodging second task - ruby

I am running delayed_job for a few background services, all of which, until recently, run in isolation e.g. send an email, write a report etc.
I now have a need for one delayed_job, as its last step, to lodge another delayed_job.
delay.deploy() - when delayed_job runs this, it triggers a deploy action, the last step of which is to ...
delay.update_status() - when delayed_job runs this job, it will check the status of the deploy we started. If the deploy is still progressing, we call delay.update_status() again, if the deploy has stopped we write the final deploy status to a db record.
Step 1 works fine - after 5 seconds, delayed_job fires up the deploy, which starts the deployment, and then calls delay.update_status().
But here,
instead of update_status() starting up in 5 seconds, delayed_job goes into a busy loop, firing of a bunch of update_status calls, and looping really hard without pause.
I can see the logs filling up with all these calls, the server slows down, until the end-condition for update_status is reached (deploy has eventually succeeded or failed), and things get quiet again.
Am I using Delayed_Job::delay() incorrectly, am I missing a basic tenent of this use-case ?

OK it turns out this is "expected behaviour" - if you are already in the code running for a delayed_job, and you call .delay() again, without specifying a delay, it will run immediately. You need to add the parameter run_at:
delay(queue: :deploy, run_at: 10.seconds.from_now).check_status
See the discussion in google groups

Related

Disable jobs marked as failed if stuck for specific Gitlab runners

We have one Gitlab runner, which is intended for bench-marking purposes.
A job can take from few minutes to possibly few days.
This all works fine, until there are two jobs and one takes too long to complete.
The waiting job, after some time, complains that it is stuck.
Afterwards it is marked as failed, never to be executed at all.
This is very annoying. For our usual pipeline it makes sense, because either the runner is dead, or the job's .gitlab.ci is not set up properly.
However here the waiting job just has to wait more.
Can we disable this stuck->failed feature for this specific runner?
(The timeout of the job is set up correctly, so it is able to run that long, as explained here)
This is currently an open issue (https://gitlab.com/gitlab-org/gitlab/-/issues/19294).

Run script hole time on VPS server

Is it possible to create a script that is always running on my VPS server? And what need i to do to run it the hole time? (I haven't yet a VPS server, but if this is possible i wants to buy one!
Yes you can, there are many methods to get your expected result.
Supervisord
Supervisord is a process control system that keeps any process running. It automatically start or restart your process whenever necessary.
When to use it: Use it when you need a process that run continuously, eg.:
A queue worker that reads a database continuously waiting for a job to run.
A node application that acts like a daemon
Cron
Cron allow you running processes regularly, in time intervals. You can for example run a process every 1 minute, or every 30 minutes, or any time interval you need.
When to use it: Use it when your process is not long running, it do a task and end, and you do not need it beign restarted automatically like on Supervisord, eg.:
A task that collects logs everyday and send it on a gzip by email
A backup routine.
Whatever you choose, there are many tutorials on the internet on how configuring both, so I'll not go into this details.

Sidekiq - view completed jobs

Is it possible to somehow view sidekiq completed job list - for example, find all PurchaseWorkers with params (1)? Yesterday in my app delayed method that was supposed to run didn't and associated entity (lets say 'purchase') got stuck in limbo with state "processing". I am trying to understand whats the reason: job wasn't en-queued at all or was en-queued but for some reason exited unexpectedly. There were no errors in sidekiq log.
Thanks.
This is old but I wanted to see the same thing since I'm not sure if jobs I scheduled ran or not!
Turns out, Sidekiq doesn't have anything built in to see jobs that completed and still doesn't seem to.
If it err'd and never completes it should be in the 'dead' queue. But to check that something actually ran seems to be beyond Sidekiq by default.
The FAQ suggests installing 3rd party plugins to track and log information: https://github.com/mperham/sidekiq/wiki/FAQ#how-can-i-tell-when-a-job-has-finished One of them allows for having a callback to do follow up (maybe add a record for completed jobs elsewhere?)
You can also setup Sidekiq to log to somewhere other than STDOUT (default) so you can output log information about your jobs. In this case, logging that it's complete or catching errors if for some reason it is never landing in the retrying or dead jobs queue when there is a problem. See https://github.com/mperham/sidekiq/wiki/Logging
To see jobs still in queue you can use the Rails console and look at the queue by queue name https://www.rubydoc.info/gems/sidekiq/Sidekiq/Queue
One option is the default stats provided by sidekiq - https://github.com/mperham/sidekiq/wiki/Monitoring#using-the-built-in-dashboard
The best options is to use the Web UI provided here - https://github.com/mperham/sidekiq/wiki/Monitoring#web-ui

spring batch step execution alert

We use Spring Batch for some long running maintenance jobs. Very occasionally a job may get stuck on database/network hiccups. Is there a way email can be sent out upon those occasions, say, if any one step takes more than 2 hours to finish, a group of people will get an email alert?
To send the mail, you can use this class from Spring : org.springframework.mail.javamail.JavaMailSenderImpl
and to check your condition, you can just implement a loop inside a org.springframework.batch.core.StepListener.
This is if u want to receive a mail AFTER a step has finished and took more than 2 hours to finish.
To receive a mail while the step is still running, and has passed the 2 hours limit, it's harder and would require some multithreading development or some external job able to monitor your main job (through org.springframework.batch.core.explore.JobExplorer).
Thanks for the replies. We end up using some other monitoring tools systematically monitoring the batch job's performance and other systems' performances.

Is there a hard limit on how long Azure role startup can take?

Suppose I include a rather long-running startup task into my Azure role - running something like up to several minutes. What happens if the startup task runs "too long".
I'm currently testing on Compute Emulator and observe the following.
I have a 450 megabytes .zip file together with Info-Zip unzip. The startup task unzips the archive. Deployment starts and I look into Task Manager. Numerous service processes start, then unzip.exe is run. After about two minutes all those processes stop and then start anew and unzip.exe starts again.
So it looks like a deployment is allowed to run for about two minutes, then is forcefully reset and started again.
Is this the expected behavior? Does it persist on real cloud? Are there any hard limits on how long a role startup can take? How do I address this situation except moving the unpacking into RoleEntryPoint.OnStart()?
I had the same question, so tried an experiment. I ran a Startup Task - taskType="simple" so that it would block the Roles from beginning to execute - and let it run for 50 hours. The Fabric Controller did not complain and the portal did not show any error. It finished its long "do nothing" loop after the 50 hours was up, then this Startup Task exited, and my Web Role started up fine.
So my emperical test says Startup Tasks can take a long time! At least 50 hours.
This should inform the load balancer that your process is still busy:
http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.serviceruntime.roleinstancestatuscheckeventargs.setbusy.aspx
I have run startup tasks that run for a pretty long time (think 20-30 mins) and the role is simply in a 'Busy' state. I don't think there is a hard limit for how long the role will stay in that state as long as the Startup task is still executing and did not exit with a non-zero return code (in fact, this is a gotcha for most first time startup task creators when they pop a prompt). The FC is technically still running just fine, so there would be no reason to 'recover' the role (i.e. heartbeats are still going).
The dev emulator just notices when the role hasn't started and warns you. If you click the 'keep waiting' option, it will continue to run the Startup task to completion. The cloud does not do this of course (warn you).
Never tried a task that ran super long, so there might be a very long limit. I seem to recall 3 hrs was a magic number in some timeout cases like role recycles, but I have never tried...
There are some heartbeats that the Azure Fabric Agent will do against the role. If these are not acknowledged (say a long-running blocking process), this could cause the role to be flagged as unavailable.
You might try putting your startup process into a background thread that runs independently. This should help you keep the role from being recycled while the process is starting up. Just keep in mind you may need to make some adjustments if you get requests before the role fully starts up. There's also a way (that I can't seem to recall ATM) to flag the role and take it out of the load balancer temporarially while your process completes.

Resources