How to launch multiple worker processes in eventmachine? - ruby

I'm using rails 3, eventmachine and rabbitmq.
When I publish message(s) to a queue, I need to launch multiple worker processes.
I understand that eventmachine is solution for my scenerio.
Some tasks will take longer than others.
Using eventmachine, from most code samples it looks like only a single thread/process will be run at any given time.
How can I launch 2-4 worker processes at a single time?

if you use the EM.defer method every proc you pass to it will be put in the thread pool (default to 20 threads). you can have as many worker you want if you change the EM.threadpool_size.
worker = Proc.new do
# log running job
end
EM.defer(worker)

Related

Ruby threading/forking with API (Sinatra)

I am using Sinatra gem for my API. What I want to do is when request is received process it, return the response and start new long running task.
I am newbie to Ruby, I have read about Threading but not sure what is the best way to accomplish my task.
Here my sinatra endpoint
post '/items' do
# Processing data
# Return response (body ...)
# Start long running task
end
I would be grateful for any advice or example.
I believe that better way to do it - is to use background jobs. While your worker executes some long-running tasks, it is unavailable for new requests. With background jobs - they do the work, while your web-worker can work with new request.
You can have a look at most popular backgroung jobs gems for ruby as a starting point: resque, delayed_jobs, sidekiq
UPD: Implementation depends on chosen gem, but general scheme will be like this:
# Controller
post '/items' do
# Processing data
MyAwesomeJob.enqueue # here you put your job into queue
head :ok # or whatever
end
In MyAwesomejob you implement your long-runnning task
Next, about Mongoid and background jobs. You should never use complex objects as job arguments. I don't know what kind of task you are implementing, but there is general answer - use simple objects.
For example, instead of using your User as argument, use user_id and then find it inside your job. If you will do it like that, you can use any DB without problems.
Agree with unkmas.
There are two ways to do this.
Threads or a background job gem like sidekiq.
Threads are perfectly fine if the processing times aren't that high and if you don't want to write code for the worker. But there is a strong possibility that you might run up too many threads if you don't use a threadpool or if you're expecting bursty http traffic.
The best way to do it is by using sidekiq or something similar. You could even have a job queue like beanstalkd in between and en-queue the job to it and return the response. You can have a worker reading from the queue and processing it later on.

Rufus Scheduler blocks after how many threads?

Writing a scheduler in rufus where the scheduled tasks will overlap. This is expected behavior, but was curious how rufus handles the overlap. Will it overlap up to n threads and then block from there? Or does it continue to overlap without a care of how many concurrent tasks run at a time.
Ideally I would like to take advantage of rufus concurrency and not have to manage my own pool of managed threads. Would like to block once I've reached the max pool count.
scheduler = Rufus::Scheduler.new
# Syncs one tenant in every call. Overlapping calls will allow for multiple
# syncs to occur until all threads expended, then blocks until a thread is available.
scheduler.every '30s', SingleTenantSyncHandler
Edit
Seeing from the README that rufus does use thread pools in version 3.x.
You can set the max thread count like:
scheduler = Rufus::Scheduler.new(:max_work_threads => 77)
Assuming this answers my question but still would like confirmation from others.
Yes, I confirm, https://github.com/jmettraux/rufus-scheduler/#max_work_threads answers your question. Note that this thread pool is shared among all scheduled jobs in the rufus-scheduler instance.

concurrent access to api with limits

This question not about ruby only.
I have many workers running to creates many connections to external API. This API has a limit.
At now I use sidekiq and redis for limiting access.
i.e. at every access to API that rated I run worker.
Then worker is started it check when was last time to access to API, if is earlier than that API allows it, the worker is rescheduled, else it touch time in redis and run request.
ex:
def run_or_schedule
limiter = RedisLimiter.new(account_token)
if limiter.can_i_run?
limiter.like
run
else
ApiWorker.perform_at(limiter.next_like, *params)
end
end
Problem is that I create many request and they many times rescheduled.
Maybe can someone can recommend a better solution?
Maybe any design patterns exist for this?
One alternative to the polling approach you are using would be to have a supervisor.
So instead of having each worker handle the question itself, you have another object/worker/process which decides when is time for the next worker to run.
If the API imposes a time-limit between requests you could have this supervisor execute as often as the time limit allows.
If the limit is more complex (e.g. total number of requests per X interval) you could have the supervisor running constantly and upon hitting a limit block (sleep) for that amount of interval. Upon resuming it could continue with the next worker from the queue.
One apparent advantage of this approach is that you could skip the overhead related to each individual worker being instantiated, checking itself whether it should run, and being rescheduled if it shouldn't.
You can use a queue, and a dedicated thread that sends events (when there any waiting) at the maximum rate allowable.
Say you can send one API call every second, you can do the following:
class APIProxy
def like(data)
#like_queue << data
end
def run
Thread.new do
#like_queue = Queue.new
loop do
actual_send_like #like_queue.pop
sleep 1
end
end
end
private
def actual_send_like(data)
# use the API you need
end
end

Why does resque use child processes for processing each job in a queue?

We have been using Resque in most of our projects, and we have been happy with it.
In a recent project, we were having a situation, where we are making a connection to a live streaming API from the twitter. Since, we have to maintain the connection, we were dumping each line from the streaming API to a resque queue, lest the connection is not lost. And we were, processing the queue afterwards.
We had a situation where the insertion rate into the queue was of the order 30-40/second and the rate at which the queue is popped was only 3-5/second. And because of this, the queue was always increasing. When we checked for reasons for this, we found that resque had a parent process, and for each job of the queue, it forks a child process, and the child process will be processing the job. Our rails environment was quite heavy and the child process forking was taking time.
So, we implemented another rake task of this sort, for the time being:
rake :process_queue => :environment do
while true
begin
interaction = Resque.pop("process_twitter_resque")
if interaction
ProcessTwitterResque.perform(interaction)
end
rescue => e
puts e.message
puts e.backtrace.join("\n")
end
end
end
and started the task like this:
nohup bundle exec rake process_queue --trace >> log/workers/process_queue/worker.log 2>&1 &
This does not handle failed jobs and all.
But, my question is why does Resque implement a child forked process to process the jobs from the queue. The jobs definitly does not need to be processed paralelly (since it is a queue and we expect it to process one after the other, sequentially and I beleive Resque also fork only 1 child process at a time).
I am sure Resque has done it with some purpose in mind. What is the exact purpose behind this parent/child process architecture?
The Ruby process that sits and listens for jobs in Redis is not the process that ultimately runs the job code written in the perform method. It is the “master” process, and its only responsibility is to listen for jobs. When it receives a job, it forks yet another process to run the code. This other “child” process is managed entirely by its master. The user is not responsible for starting or interacting with it using rake tasks. When the child process finishes running the job code, it exits and returns control to its master. The master now continues listening to Redis for its next job.
The advantage of this master-child process organization – and the advantage of Resque processes over threads – is the isolation of job code. Resque assumes that your code is flawed, and that it contains memory leaks or other errors that will cause abnormal behavior. Any memory claimed by the child process will be released when it exits. This eliminates the possibility of unmanaged memory growth over time. It also provides the master process with the ability to recover from any error in the child, no matter how severe. For example, if the child process needs to be terminated using kill -9, it will not affect the master’s ability to continue processing jobs from the Redis queue.
In earlier versions of Ruby, Resque’s main criticism was its potential to consume a lot of memory. Creating new processes means creating a separate memory space for each one. Some of this overhead was mitigated with the release of Ruby 2.0 thanks to copy-on-write. However, Resque will always require more memory than a solution that uses threads because the master process is not forked. It’s created manually using a rake task, and therefore must load whatever it needs into memory from the start. Of course, manually managing each worker process in a production application with a potentially large number of jobs quickly becomes untenable. Thankfully, we have pool managers for that.
Resque uses #fork for 2 reasons (among others): ability to prevent zombie workers (just kill them) and ability to use multiple cores (since it's another process).
Maybe this will help you with your fast-executing jobs: http://thewebfellas.com/blog/2012/12/28/resque-worker-performance

How can I make resque worker process other jobs while current job is sleeping?

Each task I have work in a short bursts, then sleep for about an hour and then work again and so on until the job is done. Some jobs may take about 10 hours to complete and there is nothing I can do about it.
What bothers me is that while job is sleeping resque worker would be busy, so if I have 4 workers and 5 jobs the last job would have to wait 10 hours until it can be processed, which is grossly unoptimal since it can work while any other worker is sleeping. Is there any way to make resque worker to process other job while current job is sleeping?
Currently I have a worker similar to this:
class ImportSongs
def self.perform(api_token, songs)
api = API.new api_token
songs.each_with_index do |song, i|
# make current worker proceed with another job while it's sleeping
sleep 60*60 if i != 0 && i % 100 == 0
api.import_song song
end
end
end
It looks like the problem you're trying to solve is API rate limiting with batch processing of the import process.
You should have one job that runs as soon as it's enqueued to enumerate all the songs to be imported. You can then break those down into groups of 100 (or whatever size you have to limit it to) and schedule a deferred job using resque-scheduler in one hour intervals.
However, if you have a hard API rate limit and you execute several of these distributed imports concurrently, you may not be able to control how much API traffic is going at once. If you have that strict of a rate limit, you may want to build a specialized process as a single point of control to enforce the rate limiting with it's own work queue.
With resque-scheduler, you'll be able to repeat discrete jobs at scheduled or delayed times as an alternative to a single, long running job that loops with sleep statements.

Resources