Heroku clock process: how to ensure jobs weren't skipped? - heroku

I'm building a Heroku app that relies on scheduled jobs. We were previously using Heroku Scheduler but clock processes seem more flexible and robust. So now we're using a clock process to enqueue background jobs at specific times/intervals.
Heroku's docs mention that clock dynos, as with all dynos, are restarted at least once per day--and this incurs the risk of a clock process skipping a scheduled job: "Since dynos are restarted at least once a day some logic will need to exist on startup of the clock process to ensure that a job interval wasn’t skipped during the dyno restart." (See https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes)
What are some recommended ways to ensure that scheduled jobs aren't skipped, and to re-enqueue any jobs that were missed?
One possible way is to create a database record whenever a job is run/enqueued, and to check for the presence of expected records at regular intervals within the clock job. The biggest downside to this is that if there's a systemic problem with the clock dyno that causes it to be down for a significant period of time, then I can't do the polling every X hours to ensure that scheduled jobs were successfully run, since that polling happens within the clock dyno.
How have you dealt with the issue of clock dyno resiliency?
Thanks!

You will need to store data about jobs somewhere. On Heroku, you don't have any informations or warranty about your code being running only once and all the time (because of cycling)
You may use a project like this on (but not very used) : https://github.com/amitree/delayed_job_recurring
Or depending on your need you could create a scheduler or process which schedule jobs for the next 24 hours and is run every 4 hours in order to be sure your jobs will be scheduled. And hope that the heroku scheduler will work at least once every 24 hours.
And have at least 2 worker processing the jobs.

Though it requires human involvement, we have our scheduled jobs check-in with Honeybadger via an after_perform hook in rails
# frozen_string_literal: true
class ScheduledJob < ApplicationJob
after_perform do |job|
check_in(job)
end
private
def check_in(job)
token = Rails.application.config_for(:check_ins)[job.class.name.underscore]
Honeybadger.check_in(token) if token.present?
end
end
This way when we happen to have poorly timed restarts from deploys we at least know should-be scheduled work didn't actually happen
Would be interested to know if someone has a more fully-baked, simple solution!

Related

How can I associate a queued job with a particular user in Laravel?

I've built a system based on Laravel where users are able to begin a "task" which repeats a number of times, with a delay between each repetition. I've accomplished this by queueing a job with an amount argument, which then recursively queues an additional job until the count is up.
For example, I start my task with 3 repetitions:
A job is queued with an amount argument of 3. It is ran, the amount is decremented to 2. The same job is queued again with a delay of 5 seconds specified.
When the job runs again, the process repeats with an amount of 1.
The last job executes, and now that the amount has reached 0, it is not queued again and the tasks have been completed.
This is working as expected, but I need to know whether a user currently has any tasks being processed. I need to be able to do the following:
Check if a particular queue has any jobs started by a particular user.
Check the value that was set for amount on that job.
I'm using the database driver for a queue named tasks. Is there any existing method to accomplish my goals here?
Thanks!
You shoudln't be using delay to queue multiple repetitions of the same job over and over. That functionality is meant for something like retrying a failed network request. Keeping jobs in the queue for multiple hours at a time can lead to memory issues with your queues if the count gets too high.
I would suggest you use the php artisan schedule:run functionality to run a command every 1-5 minutes to check the database if it is time to run a user's job. If so, kick off that job and add a status flag to the user table (or whatever table you want to keep track of these things). When finished you mark that same row as completed and wait for the next time the cron runs to do it again.

SOS-Berlin JobScheduler process queue logic

We're running into an issue with the SOS-Berlin JobScheduler running on Windows that is difficult to diagnose* and I would appreciate any guidance.
*Difficult because I don't know Scala (though I do know C++ and Java). It's difficult to navigate this code-base (some of it's in German).
We have a process-class called Foo, that will sometimes burst up outside the limit of how many processes can be run. So, for example, we limit the process-class to 30 processes and 60 want to run. This leaves 30 running and 30 "waiting for process."
The problem is that JobScheduler doesn't seem to prioritize the 30 that are waiting for a process. Instead, any new job that gets fired after the burst receives processes, leaving some jobs waiting indefinitely. Once the number of jobs "waiting for process" hits zero, the jobs clear out immediately.
Further, it seems that when there are a large number of jobs "waiting for process," the run time for tasks doubles or triples. A job that normally takes 20 seconds to run, will spike to 1-2 minutes, further amplifying the issue as processes are not released back to the pool.
Admittedly, we're running an older version of JS, which we're planning to upgrade this/next week. However, I'm wondering if there is something fundamental we're missing. We've turned down the logging, looked for DB locks, added memory to the heap, shut-down some other processes on the server. We've also increased the process pool, but we don't want to push it too far, lest we crush the server. Nothing seems to be alleviating the issue.
Any tuning help would be appreciated!
As a follow-up, we determined the cause of the issue.
Another user had been using the temp directory to store intermediate generated files. The user was not clearing out these files, resulting in 100's of thousands of files in the directory. They were not very large so we didn't notice. For some reason Job Scheduler started to choke based on this. I'm not clear on the reasons.
Clearing the temp directory, scolding the user, and fixing his script fixed the issue.

beanstalkd allowing one job to be reserved twice

I have a beanstalkd instance with two workers picking jobs from one tube.
I've noticed that occasionally one of the workers will reserve a job that has already been reserved (and being worked on) by the other worker.
I know there aren't duplicate jobs in the queue.
Why does beanstalkd allow the same job to be reserved twice?
It sounds to me that you didn't implemented the protocol properly. You need to handle DEADLINE_SOON, and do TOUCH.
What does DEADLINE_SOON mean?
DEADLINE_SOON is a response to a reserve command indicating that you have a job reserved whose deadline is real soon (current safety margin is approximately 1 second).
If you are frequently receiving DEADLINE_SOON errors on reserve, you should probably consider increasing the TTR on your jobs as it generally indicates you aren’t completing them in time. It may also be that you are failing to delete tasks when you have completed them.
See the mailing list discussion for more information.
How does TTR work?
TTR only applies to a job at the moment it becomes reserved. At that event, a timer (called “time-left” in the job stats) starts counting down from the job’s TTR.
If the timer reaches zero, the job gets put back in the ready queue.
If the job is buried, deleted, or released before the timer runs out, the timer ceases to exist.
If the job is touch"ed before the timer reaches zero, the timer starts over counting down from TTR.
The "touch" command
Allows a worker to request more time to work on a job.
This is useful for jobs that potentially take a long time, but you still want
the benefits of a TTR pulling a job away from an unresponsive worker. A worker
may periodically tell the server that it's still alive and processing a job
(e.g. it may do this on DEADLINE_SOON). The command postpones the auto
release of a reserved job until TTR seconds from when the command is issued.
The jobs take longer to run than the TTR, so it was being returned back to the queue and picked up by the other worker.
I now set a larger TTR on the job.

How can I make resque worker process other jobs while current job is sleeping?

Each task I have work in a short bursts, then sleep for about an hour and then work again and so on until the job is done. Some jobs may take about 10 hours to complete and there is nothing I can do about it.
What bothers me is that while job is sleeping resque worker would be busy, so if I have 4 workers and 5 jobs the last job would have to wait 10 hours until it can be processed, which is grossly unoptimal since it can work while any other worker is sleeping. Is there any way to make resque worker to process other job while current job is sleeping?
Currently I have a worker similar to this:
class ImportSongs
def self.perform(api_token, songs)
api = API.new api_token
songs.each_with_index do |song, i|
# make current worker proceed with another job while it's sleeping
sleep 60*60 if i != 0 && i % 100 == 0
api.import_song song
end
end
end
It looks like the problem you're trying to solve is API rate limiting with batch processing of the import process.
You should have one job that runs as soon as it's enqueued to enumerate all the songs to be imported. You can then break those down into groups of 100 (or whatever size you have to limit it to) and schedule a deferred job using resque-scheduler in one hour intervals.
However, if you have a hard API rate limit and you execute several of these distributed imports concurrently, you may not be able to control how much API traffic is going at once. If you have that strict of a rate limit, you may want to build a specialized process as a single point of control to enforce the rate limiting with it's own work queue.
With resque-scheduler, you'll be able to repeat discrete jobs at scheduled or delayed times as an alternative to a single, long running job that loops with sleep statements.

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).
This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Resources