How to handle SIGTERM with resque-status in complex jobs - heroku

I've been using resque on Heroku, which will from time to time interrupt your jobs with a SIGTERM.
Thus far I've handled this with a simple:
def process(options)
do_the_job
rescue Resque::TermException
self.defer options
end
We've started using resque-status so that we can keep track of jobs, but the method above obviously breaks that as the job will show completed when actually it's been deferred to another job.
My current thinking is that instead of deferring the current job in resque, there needs to be another job that re-queues jobs that have failed due to SIGTERM.
The trick comes in that some jobs are more complicated:
def process(options)
do_part1 unless options['part1_finished']
options['part1_finished']
do_part2
rescue Resque::TermException
self.defer options
end
Simply removing the rescue and simply retrying those jobs would cause an exception when do_part1 gets repeated.

Looking more deeply into how resque-status works, a possible work around is to go straight to resque for the re-queue using the same parameters that resque-status would use.
def process
do_part1 unless options['part1_finished']
options['part1_finished']
do_part2
rescue Resque::TermException
Resque.enqueue self.class, uuid, options
raise DeferredToNewJob
end
Of course, this is undocumented so may be incompatible with future releases of resque-status.
There is a draw back: between that job failing and the new job picking it up, the status of the first job will be reported by resque-status.
This is why I re-raise a new exception - otherwise the job status will show completed until the new worker picks up the old job, which may confuse processes that are watching and waiting for the job to finish.
By raising a new exception DeferredToNewJob, the job status will temporarily show failure, which is easier to work around at the front end, and the specific exception can be automatically cleared from the resque failure queue.
UPDATE
resque-status provides support for on_failure handler. If a method with this name is defined as an instance method on the class, we can make this even simpler
Here's my on_failure
def on_failure(e)
if e.is_a? DeferredToNewJob
tick('Waiting for new job')
else
raise e
end
end
With this in place the job spends basically no time in the failed state for processes watching it's status.
In addition, if resque-status finds this handler, then it won't raise the exception up to resque, so it won't get added to the failed queue.

Related

Python3 How to gracefully shutdown a multiprocess application

I am trying to fix a python3 application where multiple proceess and threads are created controlled by various queues and pipes. I am trying to make a form of controlled exit when someone tries to break the program with ctrl-c. However no mather what I do it always hangs just at the end.
I've tried to used Keyboard-interrupt exception and signal catch
The below code is part of the multi process code.
from multiprocessing import Process, Pipe, JoinableQueue as Queue, Event
class TaskExecutor(Process):
def __init__(....)
{inits}
def signal_handler(self, sig, frame):
print('TaskExecutor closing')
self._in_p.close()
sys.exit(1)
def run
signal.signal(signal.SIGINT, self.signal_handler)
signal.signal(signal.SIGTERM, self.signal_handler)
while True:
# Get the Task Groupe name from the Task queue.
try:
ExecCmd = self._in_p.recv() # type: TaskExecCmd
except Exceptions as e:
self._in_p.close()
return
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
self._in_p.close()
return
else
{other code executing here}
I'm getting the above print that its closing.
but im still getting a lot of different exceptions which i try to catch but it will not.
I'm am looking for some documentation on how to and in which order to shut down multiprocess and its main process.
I know it's very general question however its a very large application so if there are any question or thing i could test i could narrow it down.
Regards
So after investigating this issue further I found that in situation where I had a pipe thread, Queue thread and 4 multiprocesses running. # of these processes could end up hanging when terminating the application with ctrl-c. The Pipe and Queue process where already shut down.
In the multiprocessing documentation there are a warning.
Warning If this method is used when the associated process is using a
pipe or queue then the pipe or queue is liable to become corrupted and
may become unusable by other process. Similarly, if the process has
acquired a lock or semaphore etc. then terminating it is liable to
cause other processes to deadlock.
And I think this is what's happening.
I also found that even though I have a shutdown mechanism in my multi-process class the threads still running would of cause be considered alive (reading is_alive()) even though I know that the run() method have return IE som internal was hanging.
Now of the solution. My multiprocesses was for a design view not a Deamon because I wanted to control the shot down of them. However I changed them to Deamon so they would always be killed regardless. I first added that anyone kill signal would raise and ProgramKilled exception throughout my entire program.
def signal_handler(signum, frame):
raise ProgramKilled('Task Executor killed')
I then changed my shut down mechanism in my multi process class to
while True:
# Get the Task Groupe name from the Task queue.
try:
# Reading from pipe
ExecCmd = self._in_p.recv() # type: TaskExecCmd
# If fatal error just close it all
except BrokenPipe:
break
# This can occure close the pipe and break the loop
except EOFError:
self._in_p.close()
break
# Exception for when a kill signal is detected
# Set the multiprocess as killed (just waiting for the kill command from main)
except ProgramKilled:
self._log.info('{:30} : Died'.format(self.name))
self._KilledStatus = True
continue
# kill command from main recieved
# Shut down all we can. Ignore exceptions
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
try:
self._in_p.close()
self._out_p.join()
except Exception:
pass
self._log.info('{:30} : Kill Command executed'.format(self.name))
break
else if (not self._KilledStatus):
{Execute code}
# When out of the loop set killed event
KilledEvent.set()
And in my main thread I have added the following clean up process.
#loop though all my resources
for ThreadInterfaces in ResourceThreadDict.values():
# test each process in each resource
for ThreadIf in ThreadInterfaces:
# Wait for its event to be set
ThreadIf['KillEvent'].wait()
# When event have been recevied see if its hanging
# We know at this point every thing have been closed and all data have been purged correctly so if its still alive terminate it.
if ThreadIf['Thread'].is_alive():
try:
psutil.Process(ThreadIf['Thread'].pid).terminate()
except (psutil.NoSuchProcess, AttributeError):
pass
Af a lot of testing I know its really hard to control a termination of and app with multiple processes because you simply do not know in which order all of your processes receive this signal.
I've tried to in someway to save most of my data when its killed. Some would argue what I need that data for when manually terminating the app. But in this case this app runs a lot of external scripts and other application and any of those can lock the application and then you need to manually kill it but still retain the information for what have already been executed.
So this is my solution to my current problem with my current knowledge.
Any input or more in depth knowledge on what happening is welcome.
Please note that this app runs both on linux and windows.
Regards

Get sidekiq to execute a job immediately

At the moment, I have a sidekiq job like this:
class SyncUser
include Sidekiq::Worker
def perform(user_id)
#do stuff
end
end
I am placing a job on the queue like this:
SyncUser.perform_async user.id
This all works of course but there is a bit of a lag between calling perform_async and the job actually getting executed.
Is there anything else I can do to tell sidekiq to execute the job immediately?
There are two questions here.
If you want to execute a job immediately, in the current context you can use:
SyncUser.new.perform(user.id)
If you want to decrease the delay between asynchronous work being scheduled and when it's executed in the sidekiq worker, you can decrease the poll_interval setting:
Sidekiq.configure_server do |config|
config.poll_interval = 2
end
The poll_interval is the delay within worker backends of how frequently workers check for jobs on the queue. The average time between a job being scheduled and executed with a free worker will be poll_interval / 2.
use .perform_inline method
SyncUser.perform_inline(user.id)
If you also need to perform nested jobs, you can use Sidekiq::Testing.inline! in your production console
require 'sidekiq/testing'
Sidekiq::Testing.inline!
SyncUser.perform_inline(user.id)
For those who are using Sidekiq via the Active Job framework, you can do
SyncUser.perform_now(user.id)

Recovering cleanly from Resque::TermException or SIGTERM on Heroku

When we restart or deploy we get a number of Resque jobs in the failed queue with either Resque::TermException (SIGTERM) or Resque::DirtyExit.
We're using the new TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 in our Procfile so our worker line looks like:
worker: TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=critical,high,low
We're also using resque-retry which I thought might auto-retry on these two exceptions? But it seems to not be.
So I guess two questions:
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Thanks!
Edit: Getting all jobs to complete in less than 10 seconds seems unreasonable at scale. It seems like there needs to be a way to automatically re-queue these jobs when the Resque::DirtyExit exception is run.
I ran into this issue as well. It turns out that Heroku sends the SIGTERM signal to not just the parent process but all forked processes. This is not the logic that Resque expects which causes the RESQUE_PRE_SHUTDOWN_TIMEOUT to be skipped, forcing jobs to executed without any time to attempt to finish a job.
Heroku gives workers 30s to gracefully shutdown after a SIGTERM is issued. In most cases, this is plenty of time to finish a job with some buffer time left over to requeue the job to Resque if the job couldn't finish. However, for all of this time to be used you need to set the RESQUE_PRE_SHUTDOWN_TIMEOUT and RESQUE_TERM_TIMEOUT env vars as well as patch Resque to correctly respond to SIGTERM being sent to forked processes.
Here's a gem which patches resque and explains this issue in more detail:
https://github.com/iloveitaly/resque-heroku-signals
This will be a two part answer, first addressing Resque::TermException and then Resque::DirtyExit.
TermException
It's worth noting that if you are using ActiveJob with Rails 7 or later the retry_on and discard_on methods can be used to handle Resque::TermException. You could write the following in your job class:
retry_on(::Resque::TermException, wait: 2.minutes, attempts: 4)
or
discard_on(::Resque::TermException)
A big caveat here is that if you are using a Rails version prior to 7 you'll need to add some custom code to get this to work.
The reason is that Resque::TermException does not inherit from StandardError (it inherits from SignalException, source: https://github.com/resque/resque/blob/master/lib/resque/errors.rb#L26) and prior to Rails 7 retry_on and discard_on only handle exceptions that inherit from StandardError.
Here's the Rails 7 commit that changes this to work with all exception subclasses: https://github.com/rails/rails/commit/142ae54e54ac81a0f62eaa43c3c280307cf2127a
So if you want to use retry_on to handle Resque::TermException on a Rails version earlier than 7 you have a few options:
Monkey patch TermException so that it inherits from StandardError.
Add a rescue statement to your perform method that explicitly looks for Resque::TermException or one of its ancestors (eg SignalException, Exception).
Patch the implementation of perform_now with the Rails 7 version (this is what I did in my codebase).
Here's how you can retry on a TermException by adding a rescue to your job's perform method:
class MyJob < ActiveJob::Base
prepend RetryOnTermination
# ActiveJob's `retry_on` and `discard_on` methods don't handle
`TermException`
# because it inherits from `SignalException` rather than `StandardError`.
module RetryOnTermination
def perform(*args, **kwargs)
super
rescue Resque::TermException
Rails.logger.info("Retrying #{self.class.name} due to Resque::TermException")
self.class.set(wait: 2.minutes).perform_later(*args, **kwargs)
end
end
end
Alternatively you can use the Rails 7 definition of perform_now by adding this to your job class:
# FIXME: Here we override the Rails 6 implementation of this method with the
# Rails 7 implementation in order to be able to retry/discard exceptions that
# don't inherit from StandardError, such as `Resque::TermException`.
#
# When we upgrade to Rails 7 we should remove this.
# Latest stable Rails (7 as of this writing) source: https://github.com/rails/rails/blob/main/activejob/lib/active_job/execution.rb
# Rails 6.1 source: https://github.com/rails/rails/blob/6-1-stable/activejob/lib/active_job/execution.rb
# Rails 6.0 source (same code as 6.1): https://github.com/rails/rails/blob/6-0-stable/activejob/lib/active_job/execution.rb
#
# NOTE: I've made a minor change to the Rails 7 implementation, I've removed
# the line `ActiveSupport::ExecutionContext[:job] = self`, because `ExecutionContext`
# isn't defined prior to Rails 7.
def perform_now
# Guard against jobs that were persisted before we started counting executions by zeroing out nil counters
self.executions = (executions || 0) + 1
deserialize_arguments_if_needed
run_callbacks :perform do
perform(*arguments)
end
rescue Exception => exception
rescue_with_handler(exception) || raise
end
DirtyExit
Resque::DirtyExit is raised in the parent process, rather than the forked child process that actually executes your job code. This means that any code you have in your job for rescuing or retrying those exceptions won't work. See these lines of code where that happens:
https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L940
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L234
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L285
But fortunately, Resque provides a mechanism for dealing with this, job hooks, specifically the on_failure hook: https://github.com/resque/resque/blob/master/docs/HOOKS.md#job-hooks
A quote from those docs:
on_failure: Called with the exception and job args if any exception occurs while performing the job (or hooks), this includes Resque::DirtyExit.
And an example from those docs on how to use hooks to retry exceptions:
module RetriedJob
def on_failure_retry(e, *args)
Logger.info "Performing #{self} caused an exception (#{e}). Retrying..."
Resque.enqueue self, *args
end
end
class MyJob
extend RetriedJob
end
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do
this for all jobs? Even a monkey patch.
The Resque::DirtyExit exception is raised when the job is killed with the SIGTERM signal. The job does not have the opportunity to catch the exception as you can read here.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Don't see why it shouldn't, is the scheduler running? If not rake resque:scheduler.
I wrote a detailed blog post around some of the problems I had recently with Resque::DirtyExit, maybe it is useful => Understanding the Resque internals – Resque::DirtyExit unveiled
I've also struggled with this for awhile without finding a reliable solution.
One of the few solutions I've found is running a rake task on a schedule (cron job every 1 minute) which looks for jobs failing with Resque::DirtyExit, retries these specific jobs and removes these jobs from the failure queue.
Here's a sample of the rake task
https://gist.github.com/CharlesP/1818418754aec03403b3
This solution is clearly suboptimal but to date it's the best solution I've found to retry these jobs.
Are your resque jobs taking longer than 10 seconds to complete? If the jobs complete within 10 seconds after the initial SIGTERM is sent you should be fine. Try to break up the jobs into smaller chunks that finish quicker.
Also, you can have your worker re-enqueue the job doing something like this: https://gist.github.com/mrrooijen/3719427

Delayed Job creating Airbrakes every time it raises an error

def perform
refund_log = {
success: refund_retry.success?,
amount: refund_amount,
action: "refund"
}
if refund_retry.success?
refund_log[:reference] = refund_retry.transaction.id
refund_log[:message] = refund_retry.transaction.status
else
refund_log[:message] = refund_retry.message
refund_log[:params] = {}
refund_retry.errors.each do |error|
refund_log[:params][error.code] = error.message
end
order_transaction.message = refund_log[:params].values.join('|')
raise "delayed RefundJob has failed"
end
end
When I raise "delayed RefundJob has failed" in the else statement, it creates an Airbrake. I want to run the job again if it ends up in the else section.
Is there any way to re-queue the job without raising an exception? And prevent creating an airbrake?
I am using delayed_job version 1.
The cleanest way would be to re-queue, i.e. create a new job and enqueue it, and then exit the method normally.
To elaborate on #Roman's response, you can create a new job, with a retry parameter in it, and enqueue it.
If you maintain the retry parameter (increment it each time you re-enqueue a job), you can track how many retries you made, and thus avoid an endless retry loop.
DelayedJob expects a job to raise an error to requeued, by definition.
From there you can either :
Ignore your execpetion on airbrake side, see https://github.com/airbrake/airbrake#filtering so it still gets queued again without filling your logs
Dive into DelayedJob code where you can see on https://github.com/tobi/delayed_job/blob/master/lib/delayed/job.rb#L65 that a method named reschedule is available and used by run_with_lock ( https://github.com/tobi/delayed_job/blob/master/lib/delayed/job.rb#L99 ). From there you can call reschedule it manually, instead of raising your exception.
About the later solution, I advise adding some mechanism that still fill an airbrake report on the third or later try, you can still detect that something is wrong without the hassle of having your logs filled by the attempts.

Ruby - Error Handling - Good Practices

This is more of an opinion oriented question. When handling exceptions in nested codes such as:
Assuming you have a class that initialize another class to run a job. The job returns a value, which is then processed by the class which initially called it.
Where would you put the exception and error logging? Would you define it on the initialization of the job class in the calling class, which will handle then exception in the job execution or on both levels ?
if the job handles exceptions then you don't need to wrap the call to the job in a try catch.
but the class that initializes and runs the job could throw exceptions, so you should handle exceptions at that level as well.
here is an example:
def some_job
begin
# a bunch of logic
rescue
# handle exception
# log it
end
end
it wouldn't make sense then to do this:
def some_manager
begin
some_job
rescue
# log
end
end
but something like this makes more sense:
def some_manager
begin
# a bunch of logic
some_job
# some more logic
rescue
# handle exception
# log
end
end
and of course you would want to catch specific exceptions.
Probably the best answer, in general, for handling Exceptions in Ruby is reading Exceptional Ruby. It may change your perspective on error handling.
Having said that, your specific case. When I hear "job" in hear "background process", so I'll base my answer on that.
Your job will want to report status while it's doing it's thing. This could be states like "in queue", "running", "finished", but it also could be more informative (user facing) information: "processing first 100 out of 1000 records".
So, if an error happens in your background process, my suggestion is two-fold:
Make sure you catch exceptions before you exit the job. Your background job processor might not like a random exception coming from your code. I, personally, like the idea of catching the exception and saving it to the database, for easy retrieval later. Then again, depending on your background job processor, maybe it handles error reporting for you. (I think reque does, for example).
On the front end, use AJAX (or something) to occasionally check in to how the job is doing. Say every 10 seconds or something. In additional to getting the status of the job, also make sure you return this additional information to the user (if appropriate).

Resources