Using RabbitMQ with workers that run multithreaded jobs - Parallel Gem - ruby

I am building a system in ruby (rabbitmq, parallel gem) that takes a list of jobs, queues them up in rabbit and then has workers pop jobs off the queue to execute them.
It is easy to get a worker to pop a single job off of the queue when it is ready but I would like to have each worker run 5 threads such that when all 5 threads are processing, that worker does not pop any jobs off the queue. When a thread becomes free, the worker accepts a job from the queue.
Using the Parallel gem, the only way I see to create multi-threaded processes is with the following code.
results = Parallel.map(array, :in_processes => MAX_PROCESSES) do |item|
item.process
end
I would like to do something like
while true
cur_threads = Parallel.get_cur_threads
if cur_threads < MAX_PROCESSES
# get another job from queue
# allocate a thread for the job
end
Any ideas??

I know this is really old but I think you'd probably get what you're looking for using the work_queue gem.

Related

Sidekiq drain also performs new jobs enqueued via perform_in

I want to test a worker, that enqueues itself (based on some retry logic).
class SomeWorker
def perform
SomeWorker.perform_in(10.minutes)
end
end
SomeWorker.perform_async
SomeWorker.drain # => This continously enqueus and runs the job
...assert something...
It doesnt wait 10 minutes before running it
I was thinking that the call to drain should run only the first job and, that I would need to call drain again to run more.
What I want my test to look like is
SomeWorker.perform_async
SomeWorker.drain
...assert something is retrying...
SomeWorker.drain
...assert something is not retrying...
How can I test this?
SomeWorker.drain will keep running jobs on the queue until it's empty. If your job queues another one, it will run that as well.
If you only want to run a single job, use SomeWorker.perform_one.
Docs

Get sidekiq to execute a job immediately

At the moment, I have a sidekiq job like this:
class SyncUser
include Sidekiq::Worker
def perform(user_id)
#do stuff
end
end
I am placing a job on the queue like this:
SyncUser.perform_async user.id
This all works of course but there is a bit of a lag between calling perform_async and the job actually getting executed.
Is there anything else I can do to tell sidekiq to execute the job immediately?
There are two questions here.
If you want to execute a job immediately, in the current context you can use:
SyncUser.new.perform(user.id)
If you want to decrease the delay between asynchronous work being scheduled and when it's executed in the sidekiq worker, you can decrease the poll_interval setting:
Sidekiq.configure_server do |config|
config.poll_interval = 2
end
The poll_interval is the delay within worker backends of how frequently workers check for jobs on the queue. The average time between a job being scheduled and executed with a free worker will be poll_interval / 2.
use .perform_inline method
SyncUser.perform_inline(user.id)
If you also need to perform nested jobs, you can use Sidekiq::Testing.inline! in your production console
require 'sidekiq/testing'
Sidekiq::Testing.inline!
SyncUser.perform_inline(user.id)
For those who are using Sidekiq via the Active Job framework, you can do
SyncUser.perform_now(user.id)

Any way to snipe or terminate specific sidekiq workers?

Is it possible to snipe or cancel specific Sidekiq workers/running jobs - effectively invoking an exception or something into the worker thread to terminate it.
I have some fairly simple background ruby (MRI 1.9.3) jobs under Sidekiq (latest) that run fine and are dependent on external systems. The external systems can take varying amounts of time during which the worker must remain available.
I think I can use Sidekiq's API to get to the appropriate worker - but I don't see any 'terminate/cancel/quite/exit' methods in the docs - is this possible? Is this something other people have done?
Ps. I know I could use an async loop within the workers job to trap relevant signals and shut itself down ..but that will complicate things a bit due to the nature of the external systems.
Async loop is the best way to do it as sidekiq has no way to terminate running job.
def perform
main_thread = Thread.new do
ActiveRecord::Base.connection_pool.with_connection do
begin
# ...
ensure
$redis.set some_thread_key, 1
end
end
end
watcher_thread = Thread.new do
ActiveRecord::Base.connection_pool.with_connection do
until $redis.del(some_thread_key) == 1 do
sleep 1
end
main_thread.kill
until !!main_thread.status == false do
sleep 0.1
end
end
end
[main_thread, watcher_thread].each(&:join)
end

Recovering cleanly from Resque::TermException or SIGTERM on Heroku

When we restart or deploy we get a number of Resque jobs in the failed queue with either Resque::TermException (SIGTERM) or Resque::DirtyExit.
We're using the new TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 in our Procfile so our worker line looks like:
worker: TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=critical,high,low
We're also using resque-retry which I thought might auto-retry on these two exceptions? But it seems to not be.
So I guess two questions:
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Thanks!
Edit: Getting all jobs to complete in less than 10 seconds seems unreasonable at scale. It seems like there needs to be a way to automatically re-queue these jobs when the Resque::DirtyExit exception is run.
I ran into this issue as well. It turns out that Heroku sends the SIGTERM signal to not just the parent process but all forked processes. This is not the logic that Resque expects which causes the RESQUE_PRE_SHUTDOWN_TIMEOUT to be skipped, forcing jobs to executed without any time to attempt to finish a job.
Heroku gives workers 30s to gracefully shutdown after a SIGTERM is issued. In most cases, this is plenty of time to finish a job with some buffer time left over to requeue the job to Resque if the job couldn't finish. However, for all of this time to be used you need to set the RESQUE_PRE_SHUTDOWN_TIMEOUT and RESQUE_TERM_TIMEOUT env vars as well as patch Resque to correctly respond to SIGTERM being sent to forked processes.
Here's a gem which patches resque and explains this issue in more detail:
https://github.com/iloveitaly/resque-heroku-signals
This will be a two part answer, first addressing Resque::TermException and then Resque::DirtyExit.
TermException
It's worth noting that if you are using ActiveJob with Rails 7 or later the retry_on and discard_on methods can be used to handle Resque::TermException. You could write the following in your job class:
retry_on(::Resque::TermException, wait: 2.minutes, attempts: 4)
or
discard_on(::Resque::TermException)
A big caveat here is that if you are using a Rails version prior to 7 you'll need to add some custom code to get this to work.
The reason is that Resque::TermException does not inherit from StandardError (it inherits from SignalException, source: https://github.com/resque/resque/blob/master/lib/resque/errors.rb#L26) and prior to Rails 7 retry_on and discard_on only handle exceptions that inherit from StandardError.
Here's the Rails 7 commit that changes this to work with all exception subclasses: https://github.com/rails/rails/commit/142ae54e54ac81a0f62eaa43c3c280307cf2127a
So if you want to use retry_on to handle Resque::TermException on a Rails version earlier than 7 you have a few options:
Monkey patch TermException so that it inherits from StandardError.
Add a rescue statement to your perform method that explicitly looks for Resque::TermException or one of its ancestors (eg SignalException, Exception).
Patch the implementation of perform_now with the Rails 7 version (this is what I did in my codebase).
Here's how you can retry on a TermException by adding a rescue to your job's perform method:
class MyJob < ActiveJob::Base
prepend RetryOnTermination
# ActiveJob's `retry_on` and `discard_on` methods don't handle
`TermException`
# because it inherits from `SignalException` rather than `StandardError`.
module RetryOnTermination
def perform(*args, **kwargs)
super
rescue Resque::TermException
Rails.logger.info("Retrying #{self.class.name} due to Resque::TermException")
self.class.set(wait: 2.minutes).perform_later(*args, **kwargs)
end
end
end
Alternatively you can use the Rails 7 definition of perform_now by adding this to your job class:
# FIXME: Here we override the Rails 6 implementation of this method with the
# Rails 7 implementation in order to be able to retry/discard exceptions that
# don't inherit from StandardError, such as `Resque::TermException`.
#
# When we upgrade to Rails 7 we should remove this.
# Latest stable Rails (7 as of this writing) source: https://github.com/rails/rails/blob/main/activejob/lib/active_job/execution.rb
# Rails 6.1 source: https://github.com/rails/rails/blob/6-1-stable/activejob/lib/active_job/execution.rb
# Rails 6.0 source (same code as 6.1): https://github.com/rails/rails/blob/6-0-stable/activejob/lib/active_job/execution.rb
#
# NOTE: I've made a minor change to the Rails 7 implementation, I've removed
# the line `ActiveSupport::ExecutionContext[:job] = self`, because `ExecutionContext`
# isn't defined prior to Rails 7.
def perform_now
# Guard against jobs that were persisted before we started counting executions by zeroing out nil counters
self.executions = (executions || 0) + 1
deserialize_arguments_if_needed
run_callbacks :perform do
perform(*arguments)
end
rescue Exception => exception
rescue_with_handler(exception) || raise
end
DirtyExit
Resque::DirtyExit is raised in the parent process, rather than the forked child process that actually executes your job code. This means that any code you have in your job for rescuing or retrying those exceptions won't work. See these lines of code where that happens:
https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L940
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L234
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L285
But fortunately, Resque provides a mechanism for dealing with this, job hooks, specifically the on_failure hook: https://github.com/resque/resque/blob/master/docs/HOOKS.md#job-hooks
A quote from those docs:
on_failure: Called with the exception and job args if any exception occurs while performing the job (or hooks), this includes Resque::DirtyExit.
And an example from those docs on how to use hooks to retry exceptions:
module RetriedJob
def on_failure_retry(e, *args)
Logger.info "Performing #{self} caused an exception (#{e}). Retrying..."
Resque.enqueue self, *args
end
end
class MyJob
extend RetriedJob
end
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do
this for all jobs? Even a monkey patch.
The Resque::DirtyExit exception is raised when the job is killed with the SIGTERM signal. The job does not have the opportunity to catch the exception as you can read here.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Don't see why it shouldn't, is the scheduler running? If not rake resque:scheduler.
I wrote a detailed blog post around some of the problems I had recently with Resque::DirtyExit, maybe it is useful => Understanding the Resque internals – Resque::DirtyExit unveiled
I've also struggled with this for awhile without finding a reliable solution.
One of the few solutions I've found is running a rake task on a schedule (cron job every 1 minute) which looks for jobs failing with Resque::DirtyExit, retries these specific jobs and removes these jobs from the failure queue.
Here's a sample of the rake task
https://gist.github.com/CharlesP/1818418754aec03403b3
This solution is clearly suboptimal but to date it's the best solution I've found to retry these jobs.
Are your resque jobs taking longer than 10 seconds to complete? If the jobs complete within 10 seconds after the initial SIGTERM is sent you should be fine. Try to break up the jobs into smaller chunks that finish quicker.
Also, you can have your worker re-enqueue the job doing something like this: https://gist.github.com/mrrooijen/3719427

ruby thread block?

I read somewhere that ruby threads/fibre block the IO even with 1.9. Is this true and what does it truly mean? If I do some net/http stuff on multiple threads, is only 1 thread running at a given time for that request?
thanks
Assuming you are using CRuby, only one thread will be running at a time. However, the requests will be made in parallel, because each thread will be blocked on its IO while its IO is not finished. So if you do something like this:
require 'open-uri'
threads = 10.times.map do
Thread.new do
open('http://example.com').read.length
end
end
threads.map &:join
puts threads.map &:value
it will be much faster than doing it sequentially.
Also, you can check to see if a thread is finished w/o blocking on it's completion.
For example:
require 'open-uri'
thread = Thread.new do
sleep 10
open('http://example.com').read.length
end
puts 'still running' until thread.join(5)
puts thread.value
With CRuby, the threads cannot run at the same time, but they are still useful. Some of the other implementations, like JRuby, have real threads and can run multiple threads in parallel.
Some good references:
http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already/
http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/
All threads run simultaneously but IO will be blocked until they all finish.
In other words, threading doesn't give you the ability to "background" a process. The interpreter will wait for all of the threads to complete before sending further messages.
This is good if you think about it because you don't have to wonder about whether they are complete if your next process uses data that the thread is modifying/working with.
If you want to background processes checkout delayed_job

Resources