Ruby and Rails Async - ruby

I need to perform long-running operation in ruby/rails asynchronously.
Googling around one of the options I find is Sidekiq.
class WeeklyReportWorker
include Sidekiq::Worker
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
report = WeeklyReport.build(user, product, year, week)
report.save
end
end
# call WeeklyReportWorker.perform_async('user', 'product')
Everything works great! But there is a problem.
If I keep calling this async method every few seconds, but the actual time heavy operation performs is one minute things won't work.
Let me put it in example.
5.times { WeeklyReportWorker.perform_async('user', 'product') }
Now my heavy operation will be performed 5 times. Optimally it should have performed only once or twice depending on whether execution of first operaton started before 5th async call was made.
Do you have tips how to solve it?

Here's a naive approach. I'm a resque user, maybe sidekiq has something better to offer.
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
# first, make a name for lock key. For example, include all arguments
# there, so that another perform with the same arguments won't do any work
# while the first one is still running
lock_key_name = make_lock_key_name(user, product, year, week)
Sidekiq.redis do |redis| # sidekiq uses redis, let us leverage that
begin
res = redis.incr lock_key_name
return if res != 1 # protection from race condition. Since incr is atomic,
# the very first one will set value to 1. All subsequent
# incrs will return greater values.
# if incr returned not 1, then another copy of this
# operation is already running, so we quit.
# finally, perform your business logic here
report = WeeklyReport.build(user, product, year, week)
report.save
ensure
redis.del lock_key_name # drop lock key, so that operation may run again.
end
end
end

I am not sure I understood your scenario well, but how about looking at this gem:
https://github.com/collectiveidea/delayed_job
So instead of doing:
5.times { WeeklyReportWorker.perform_async('user', 'product') }
You can do:
5.times { WeeklyReportWorker.delay.perform('user', 'product') }
Out of the box, this will make the worker process the second job after the first job, but only if you use the default settings (because by default the worker process is only one).
The gem offers possibilities to:
Put jobs on a queue;
Have different queues for different jobs if that is required;
Have more than one workers to process a queue (for example, you can start 4 workers on a 4-CPU machine for higher efficiency);
Schedule jobs to run at exact times, or after set amount of time after queueing the job. (Or, by default, schedule for immediate background execution).
I hope it can help you as you did to me.

Related

how can I run ruby script for n-occurrences every X-minutes

I am looking at https://github.com/jmettraux/rufus-scheduler which nicely allows me to schedule and chain events.
But I want the events to stop after, say, the 10th occurrence or after the 24 days.
How do I do this?
My case would be:
run a script which creates the recurring jobs based on intervals and then stops after a given date or occurrence.
This is what I have done.
def run_schedule(url, count, method, interval)
puts "running scheduler"
scheduler = Rufus::Scheduler.new
scheduler.every interval do
binding.pry
attack_loop(url, count, method)
end
end
I am testing my site and want the attack_loop to be scheduled in memory to run against the interval.
But it appears it never hits the binding.pry line.
Normally these schedulers are running via cron jobs. Then the problem with your requirement is cron job doesn't know whether you hit the 10th occurrence or the 24 days as it doesnt keep a track. One possible solution would be to create a separate table to update the cron job details.
I'm thinking a table like,
scheduler_details
- id
- occurrence_count
- created_date
- updated_date
_ scheduler_type
So, now when you run a script, you can create or update the details. (You can search the script by scheduler_type, that way
you can check the number of occurrences
with created date, you can calculate the 24 days
HTH
Good day, you can specify the number of times a job should run:
https://github.com/jmettraux/rufus-scheduler/#times--nb-of-times-before-auto-unscheduling
If you need a more elaborate stop condition, you could do:
scheduler.every interval do |job|
if Time.now > some_max_date
job.unschedule
else
attack_loop(url, count, method)
end
end
But it appears it never hits the binding.pry line.
What has it to do with your issue title? Are you mixing issues?

Big task or multiple small tasks with Sidekiq

I'm writting a worker to add lot's of users into a group. I'm wondering if it's better to run a big task who had all users, or batch like 100 users or one by one per task.
For the moment here is my code
class AddUsersToGroupWorker
include Sidekiq::Worker
sidekiq_options :queue => :group_utility
def perform(store_id, group_id, user_ids_to_add)
begin
store = Store.find store_id
group = Group.find group_id
rescue ActiveRecord::RecordNotFound => e
Airbrake.notify e
return
end
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
group.users += users_to_process
users_to_process.map(&:id).each do |user_to_process_id|
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
end
Maybe it's better to have something like this in my method :
def add_users
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
users_to_process.map(&:id).each do |user_to_process_id|
AddUserToGroupWorker.perform_async group_id, user_to_process_id
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
But so many find request. What do you think ?
I have a sidekig pro licence if needed (for batch for example).
Here are my thoughts.
1. Do a single SQL query instead of N queries
This line: group.users += users_to_process is likely to produce N SQL queries (where N is users_to_process.count). I assume that you have many-to-many connection between users and groups (with user_groups join table/model), so you should use some Mass inserting data technique:
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
Yes, it's raw SQL. And it's fast.
2. User pluck(:id) instead of map(&:id)
pluck is much quicker, because:
It will select only 'id' column, so less data is transferred from DB
More importantly, it won't create ActiveRecord object for each raw
Doing SQL is cheap. Creating Ruby objects is really expensive.
3. Use horizontal parallelization instead of vertical parallelization
What I mean here, is if you need to do sequential tasks A -> B -> C for a dozen of records, there are two major ways to split the work:
Vertical segmentation. AWorker does A(1), A(2), A(3); BWorker does B(1), etc.; CWorker does all C(i) jobs;
Horizontal segmentation. UniversalWorker does A(1)+B(1)+C(1).
Use the latter (horizontal) way.
It's a statement from experience, not from some theoretical point of view (where both ways are feasible).
Why you should do that?
When you use vertical segmentation, you will likely get errors when you pass job from one worker down to another. Like such kind of errors. You will pull your hair out if you bump into such errors, because they aren't persistent and easily reproducible. Sometimes they happen and sometimes they aren't. Is it possible to write a code which will pass the work down the chain without errors? Sure, it is. But it's better to keep it simple.
Imagine that your server is at rest. And then suddenly new jobs arrive. Your B and C workers will just waste the RAM, while your A workers do the job. And then your A and C will waste the RAM, while B's are at work. And so on. If you make horizontal segmentation, your resource drain will even itself out.
Applying that advice to your specific case: for starters, don't call perform_async in another async task.
4. Process in batches
Answering your original question – yes, do process in batches. Creating and managing async task takes some resources by itself, so there's no need to create too many of them.
TL;DR So in the end, your code could look something like this:
# model code
BATCH_SIZE = 100
def add_users
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
# With 100,000 users performance of this query should be acceptable
# to make it in a synchronous fasion
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
users_to_process_ids.each_slice(BATCH_SIZE) do |batch|
AddUserToGroupWorker.perform_async group_id, batch
end
end
# add_user_to_group_worker.rb
def perform(group_id, user_ids_to_add)
group = Group.find group_id
# Do some heavy load with a batch as a whole
# ...
# ...
# If nothing here is left, call UpdateLastUpdatesForUserWorker from the model instead
user_ids_to_add.each do |id|
# do it synchronously – we already parallelized the job
# by splitting it in slices in the model above
UpdateLastUpdatesForUserWorker.new.perform store.id, user_to_process_id
end
end
There's no silver bullet. It depends on your goals and your application. General questions to ask yourself:
How much user ids could you pass to a worker? Is it possible to pass 100? What about 1000000?
How long your workers can work? Should it have any restrictions about working time? Can they stuck?
For a big applications it's necessary to split passed arguments to smaller chunks, to avoid creating long-running jobs. Creating a lot of small jobs allows you to scale easily - you can always add more workers.
Also it might be a good idea to define kind of timeout for workers, to stop processing of stuck workers.

Dynamically loading new jobs in a SidekiqStatus container to monitor completion

I built a small web crawler implemented in two Sidekiq workers: Crawler and Parsing. The Crawler worker will seek for links while Parsing worker will read the page body.
I want to trigger an alert when the crawling/parsing of all pages is complete. Monitoring only the Crawler job is not the best solution since it may have finished but there might be several Parser jobs running.
Having a look at sidekiq-status gem it seems that I cannot dynamically add new jobs to the container for monitoring. E.g. it would be nice to have a "add" method in the following context:
#container = SidekiqStatus::Container.new
# ... for each page url found:
jid = ParserWorker.perform_async(page_url)
#container.add(jid)
The closest to this is to use "SidekiqStatus::Container.load" or "SidekiqStatus::Container.load_multi" however, it is not possible to add new jobs in the container a posteriori.
One solution would be to create as many SidekiqStatus::Container instances as the number of ParserJobs and check if all of them have status == "finished", but I wonder if a more elegant solution exists using these tools.
Any help is appreciated.
You are describing Sidekiq Pro's Batches feature exactly. You can spend a lot of time or some money to solve your problem.
https://github.com/mperham/sidekiq/wiki/Batches
OK, here's a simple solution. Using the sidekiq-status gem, the Crawler worker keeps track of the jobs IDs for the Parser jobs and halts if any Parser job is still busy (using the SidekiqStatus::Container instance to check job status).
def perform()
# for each page....
#jids << ParserWorker.perform_async(page_url)
# end
# crawler finished, parsers may still be running
while parsers_busy?
sleep 5 # wait 5 secs between each check
end
# all parsers complete, trigger notification...
end
def parsers_busy?
status_containers = SidekiqStatus::Container.load_multi(#jids)
for container in status_containers
if container.status == 'waiting' || container.status == 'working'
return true
end
end
return false
end

Howto know that I do not block Ruby eventmachine with a mongodb operation

I am working on a eventmachine based application that periodically polls for changes of MongoDB stored documents.
A simplified code snippet could look like:
require 'rubygems'
require 'eventmachine'
require 'em-mongo'
require 'bson'
EM.run {
#db = EM::Mongo::Connection.new('localhost').db('foo_development')
#posts = #db.collection('posts')
#comments = #db.collection('comments')
def handle_changed_posts
EM.next_tick do
cursor = #posts.find(state: 'changed')
resp = cursor.defer_as_a
resp.callback do |documents|
handle_comments documents.map{|h| h["comment_id"]}.map(&:to_s) unless documents.length == 0
end
resp.errback do |err|
raise *err
end
end
end
def handle_comments comment_ids
meta_product_ids.each do |id|
cursor = #comments.find({_id: BSON::ObjectId(id)})
resp = cursor.defer_as_a
resp.callback do |documents|
magic_value = documents.first['weight'].to_i * documents.first['importance'].to_i
end
resp.errback do |err|
raise *err
end
end
end
EM.add_periodic_timer(1) do
puts "alive: #{Time.now.to_i}"
end
EM.add_periodic_timer(5) do
handle_changed_posts
end
}
So every 5 seconds EM iterates over all posts, and selects the changed ones. For each changed post it stores the comment_id in an array. When done that array is passed to a handle_comments which loads every comment and does some calculation.
Now I have some difficulties in understanding:
I know, that this load_posts->load_comments->calculate cycle takes 3 seconds in a Rails console with 20000 posts, so it will not be much faster in EM. I schedule the handle_changed_posts method every 5 seconds which is fine unless the number of posts raises and the calculation takes longer than the 5 seconds after which the same run is scheduled again. In that case I'd have a problem soon. How to avoid that?
I trust em-mongo but I do not trust my EM knowledge. To monitor EM is still running I puts a timestamp every second. This seems to be working fine but gets a bit bumpy every 5 seconds when my calculation runs. Is that a sign, that I block the loop?
Is there any general way to find out if I block the loop?
Should I nice my eventmachine process with -19 to give it top OS prio always?
I have been reluctant to answer here since I've got no mongo experience so far, but considering no one is answering and some of the stuff here is general EM stuff I may be able to help:
schedule next scan on first scan's end (resp.callback and resp.errback in handle_changed_posts seem like good candidates to chain next scan), either with add_timer or with next_tick
probably, try handling your mongo trips more often so they handle smaller chunks of data, any cpu cycle hog inside your reactor would make your reactor loop too busy to accept events such as periodic timer ticks
no simple way, no. One idea would be to measure diff of Time.now to next_tick{Time.now}, do benchmark and then trace possible culprits when the diff crosses a threshold. Simulating slow queries (Simulate slow query in mongodb? ?) and many parallel connections is a good idea
I honestly don't know, I've never encountered people who do that, I expect it depends on other things running on that server
To expand upon bbozo's answer, specifically in relation to your second question, there is no time when you run code that you do not block the loop. In my experience, when we talk about 'non-blocking' code what we really mean is 'code that doesn't block very long'. Typically, these are very short periods of time (less than a millisecond), but they still block while executing.
Further, the only thing next_tick really does is to say 'do this, but not right now'. What you really want to do, as bbozo mentioned, is split up your processing over multiple ticks such that each iteration blocks for as little time as possible.
To use your own benchmarks, if 20,000 records takes about 3 seconds to process, 4,000 records should take about 0.6 seconds. This would be short enough to not usually affect your 1 second heartbeat. You could split it up even farther to reduce the amount of blockage and make the reactor run smoother, but it really depends on how much concurrency you need from the reactor.

Rails 3.2.8 - Should I use Thread when I save multiple data into database?

I'm using Thread quite often and I wonder if this is a good practice:
def self.create_all_posts
threads = []
self.fetch_all_posts.each do |e|
if e.present?
threads << Thread.new {
self.create(title: e[:title], url: e[:url])
}
end
end
main = Thread.main # The main thread
current = Thread.current # The current thread
all = Thread.list # All threads still running
all.each { |t| t.join }
end
Basically, yes. You might need to call config.threadsafe! in the application.rb and mayve allow_concurrency: true in the database.yml. Depending on your rails version you might needat least first one, otherwise your db request maight not run in parallel.
Still, in your case there might be no big performance effect on running everal "INSERT INTO..." in parallel, thought it heavily depend on your disks, memory and CPU situation on db host. BTW, if your fetch_all_posts takes considerable time to fetch, you can use find_each approach, that possible would start creation threads in parallel of scnning huge data set. You can set the 'page' size for find_each to make it run theads, say, on every 10 posts.

Resources