Howto know that I do not block Ruby eventmachine with a mongodb operation - ruby

I am working on a eventmachine based application that periodically polls for changes of MongoDB stored documents.
A simplified code snippet could look like:
require 'rubygems'
require 'eventmachine'
require 'em-mongo'
require 'bson'
EM.run {
#db = EM::Mongo::Connection.new('localhost').db('foo_development')
#posts = #db.collection('posts')
#comments = #db.collection('comments')
def handle_changed_posts
EM.next_tick do
cursor = #posts.find(state: 'changed')
resp = cursor.defer_as_a
resp.callback do |documents|
handle_comments documents.map{|h| h["comment_id"]}.map(&:to_s) unless documents.length == 0
end
resp.errback do |err|
raise *err
end
end
end
def handle_comments comment_ids
meta_product_ids.each do |id|
cursor = #comments.find({_id: BSON::ObjectId(id)})
resp = cursor.defer_as_a
resp.callback do |documents|
magic_value = documents.first['weight'].to_i * documents.first['importance'].to_i
end
resp.errback do |err|
raise *err
end
end
end
EM.add_periodic_timer(1) do
puts "alive: #{Time.now.to_i}"
end
EM.add_periodic_timer(5) do
handle_changed_posts
end
}
So every 5 seconds EM iterates over all posts, and selects the changed ones. For each changed post it stores the comment_id in an array. When done that array is passed to a handle_comments which loads every comment and does some calculation.
Now I have some difficulties in understanding:
I know, that this load_posts->load_comments->calculate cycle takes 3 seconds in a Rails console with 20000 posts, so it will not be much faster in EM. I schedule the handle_changed_posts method every 5 seconds which is fine unless the number of posts raises and the calculation takes longer than the 5 seconds after which the same run is scheduled again. In that case I'd have a problem soon. How to avoid that?
I trust em-mongo but I do not trust my EM knowledge. To monitor EM is still running I puts a timestamp every second. This seems to be working fine but gets a bit bumpy every 5 seconds when my calculation runs. Is that a sign, that I block the loop?
Is there any general way to find out if I block the loop?
Should I nice my eventmachine process with -19 to give it top OS prio always?

I have been reluctant to answer here since I've got no mongo experience so far, but considering no one is answering and some of the stuff here is general EM stuff I may be able to help:
schedule next scan on first scan's end (resp.callback and resp.errback in handle_changed_posts seem like good candidates to chain next scan), either with add_timer or with next_tick
probably, try handling your mongo trips more often so they handle smaller chunks of data, any cpu cycle hog inside your reactor would make your reactor loop too busy to accept events such as periodic timer ticks
no simple way, no. One idea would be to measure diff of Time.now to next_tick{Time.now}, do benchmark and then trace possible culprits when the diff crosses a threshold. Simulating slow queries (Simulate slow query in mongodb? ?) and many parallel connections is a good idea
I honestly don't know, I've never encountered people who do that, I expect it depends on other things running on that server

To expand upon bbozo's answer, specifically in relation to your second question, there is no time when you run code that you do not block the loop. In my experience, when we talk about 'non-blocking' code what we really mean is 'code that doesn't block very long'. Typically, these are very short periods of time (less than a millisecond), but they still block while executing.
Further, the only thing next_tick really does is to say 'do this, but not right now'. What you really want to do, as bbozo mentioned, is split up your processing over multiple ticks such that each iteration blocks for as little time as possible.
To use your own benchmarks, if 20,000 records takes about 3 seconds to process, 4,000 records should take about 0.6 seconds. This would be short enough to not usually affect your 1 second heartbeat. You could split it up even farther to reduce the amount of blockage and make the reactor run smoother, but it really depends on how much concurrency you need from the reactor.

Related

In Ruby, is it possible to share a database connection across threads?

I've got a small little ruby script that pours over 80,000 or so records.
The processor and memory load involved for each record is smaller than a smurf balls, but it still takes about 8 minutes to walk all the records.
I'd though to use threading, but when I gave it a go, my db ran out of connections. Sure it was when I attempted to connect 200 times, and really I could limit it better than that.. But when I'm pushing this code up to Heroku (where I have 20 connections for all workers to share), I don't want to chance blocking other processes because this one ramped up.
I have thought of refactoring the code so that it conjoins the all the SQL, but that is going to feel really really messy.
So I'm wondering is there a trick to letting the threads share connections? Given I don't expect the connection variable to change during processing, I am actually sort of surprised that the thread fork needs to create a new DB connection.
Well any help would be super cool (just like me).. thanks
SUPER CONTRIVED EXAMPLE
Below is a 100% contrived example. It does display the issue.
I am using ActiveRecord inside a very simple thread. It seems each thread is creating it's own connection to the database. I base that assumption on the warning message that follows.
START_TIME = Time.now
require 'rubygems'
require 'erb'
require "active_record"
#environment = 'development'
#dbconfig = YAML.load(ERB.new(File.read('config/database.yml')).result)
ActiveRecord::Base.establish_connection #dbconfig[#environment]
class Product < ActiveRecord::Base; end
ids = Product.pluck(:id)
p "after pluck #{Time.now.to_f - START_TIME.to_f}"
threads = [];
ids.each do |id|
threads << Thread.new {Product.where(:id => id).update_all(:product_status_id => 99); }
if(threads.size > 4)
threads.each(&:join)
threads = []
p "after thread join #{Time.now.to_f - START_TIME.to_f}"
end
end
p "#{Time.now.to_f - START_TIME.to_f}"
OUTPUT
"after pluck 0.6663269996643066"
DEPRECATION WARNING: Database connections will not be closed automatically, please close your
database connection at the end of the thread by calling `close` on your
connection. For example: ActiveRecord::Base.connection.close
. (called from mon_synchronize at /Users/davidrawk/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/monitor.rb:211)
.....
"after thread join 5.7263710498809814" #THIS HAPPENS AFTER THE FIRST JOIN.
.....
"after thread join 10.743254899978638" #THIS HAPPENS AFTER THE SECOND JOIN
See this gem https://github.com/mperham/connection_pool and answer, a connection pool might be what you need: Why not use shared ActiveRecord connections for Rspec + Selenium?
The other option would be to use https://github.com/eventmachine/eventmachine and run your tasks in EM.defer block in such a way that DB access happens in the callback block (within reactor) in a non-blocking way
Alternatively, and a more robust solution too, go for a light-weight background processing queue such as beanstalkd, see https://www.ruby-toolbox.com/categories/Background_Jobs for more options - this would be my primary recommendation
EDIT,
also, you probably don't have 200 cores, so creating 200+ parallel threads and db connections doesn't really speed up the process (slows it down actually), see if you can find a way to partition your problem into a number of sets equal to your number of cores + 1 and solve the problem this way,
this is probably the simplest solution to your problem

Is there a way to call a block every microsecond using celluloid?

I'm using celluloid's every method to execute a block every microsecond however it seems to always call the block every second even when I specify a decimal.
interval = 1.0 / 2.0
every interval do
puts "*"*80
puts "Time: #{Time.now}"
puts "*"*80
end
I would expect this to be called every 0.5 seconds. But it is called every one second.
Any suggestions?
You can get fractional second resolution with Celluloid.
Celluloid uses the Timers gem to manage the every, which does good floating point time math and ruby's sleep which has reasonable sub-second resolution.
The following code works perfectly:
class Bob
include Celluloid
def fred
every 0.5 do
puts Time.now.strftime "%M:%S.%N"
end
end
end
Bob.new.fred
And it produces the following output:
22:51.299923000
22:51.801311000
22:52.302229000
22:52.803512000
22:53.304800000
22:53.805759000
22:54.307003000
22:54.808279000
22:55.309358000
22:55.810017000
As you can see, it is not perfect, but close enough for most purposes.
If you are seeing different results, it is likely because of how long your code takes in the block you have given to every or other timers running and starving that particular one. I would approach it by simplifying the situation as much as possible and slowly adding parts back in to determine where the slowdown is occurring.
As for microsecond resolution, I don't think you can hope to get that far down reliably with any non-trivial code.
The trivial example:
def bob
puts Time.now.strftime "%M:%S.%N"
sleep 1.0e-6
puts Time.now.strftime "%M:%S.%N"
end
Produces:
31:07.373858000
31:07.373936000
31:08.430110000
31:08.430183000
31:09.062000000
31:09.062079000
31:09.638078000
31:09.638156000
So as you can see, even just a base ruby version on my machine running nothing but a simple IO line doesn't reliably give me microsecond speeds.

Ruby and Rails Async

I need to perform long-running operation in ruby/rails asynchronously.
Googling around one of the options I find is Sidekiq.
class WeeklyReportWorker
include Sidekiq::Worker
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
report = WeeklyReport.build(user, product, year, week)
report.save
end
end
# call WeeklyReportWorker.perform_async('user', 'product')
Everything works great! But there is a problem.
If I keep calling this async method every few seconds, but the actual time heavy operation performs is one minute things won't work.
Let me put it in example.
5.times { WeeklyReportWorker.perform_async('user', 'product') }
Now my heavy operation will be performed 5 times. Optimally it should have performed only once or twice depending on whether execution of first operaton started before 5th async call was made.
Do you have tips how to solve it?
Here's a naive approach. I'm a resque user, maybe sidekiq has something better to offer.
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
# first, make a name for lock key. For example, include all arguments
# there, so that another perform with the same arguments won't do any work
# while the first one is still running
lock_key_name = make_lock_key_name(user, product, year, week)
Sidekiq.redis do |redis| # sidekiq uses redis, let us leverage that
begin
res = redis.incr lock_key_name
return if res != 1 # protection from race condition. Since incr is atomic,
# the very first one will set value to 1. All subsequent
# incrs will return greater values.
# if incr returned not 1, then another copy of this
# operation is already running, so we quit.
# finally, perform your business logic here
report = WeeklyReport.build(user, product, year, week)
report.save
ensure
redis.del lock_key_name # drop lock key, so that operation may run again.
end
end
end
I am not sure I understood your scenario well, but how about looking at this gem:
https://github.com/collectiveidea/delayed_job
So instead of doing:
5.times { WeeklyReportWorker.perform_async('user', 'product') }
You can do:
5.times { WeeklyReportWorker.delay.perform('user', 'product') }
Out of the box, this will make the worker process the second job after the first job, but only if you use the default settings (because by default the worker process is only one).
The gem offers possibilities to:
Put jobs on a queue;
Have different queues for different jobs if that is required;
Have more than one workers to process a queue (for example, you can start 4 workers on a 4-CPU machine for higher efficiency);
Schedule jobs to run at exact times, or after set amount of time after queueing the job. (Or, by default, schedule for immediate background execution).
I hope it can help you as you did to me.

Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.
I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.
Ideas?
Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.
The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.
Paul Dix, the original author, talked about his design goals on his blog.
This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby
require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'
BASE_URL = ''
url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)
hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
gzip_url = url.join(gzip)
request = Typhoeus::Request.new(gzip_url.to_s)
request.on_complete do |resp|
gzip_filename = resp.request.url.split('/').last
puts "writing #{gzip_filename}"
File.open("gz/#{gzip_filename}", 'w') do |fo|
fo.write resp.body
end
end
puts "queuing #{ gzip }"
hydra.queue(request)
end
hydra.run
Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.
I give it a 8 out of 10; It's got a great beat and I can dance to it.
EDIT:
When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.
I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm
From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.
You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.
To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.
The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 10
urls.each do |url|
wq.enqueue_b do
response = Net::HTTP.get_response(uri)
puts response.code
end
end
wq.join

EventMachine - how can you tell if you're falling behind?

I'm looking into using the EventMachine powered twitter-stream rubygem to track and capture tweets. I'm kind of new to the whole evented programming thing. How can I tell if whatever processing I'm doing in my event loop is causing me to fall behind? Is there an easy way to check?
You can determine the latency by using a periodic timer and printing out the elapsed time. If you're using a timer of 1 second you should have about 1 second elapsed, if it's greater you know how much you're slowing down the reactor.
#last = Time.now.to_f
EM.add_periodic_timer(1) do
puts "LATENCY: #{Time.now.to_f - #last}"
#last = Time.now.to_f
end
EventMachine has a EventMachine::Queue.size method that lets you peek at the current queue and get an idea how big it is.
You could add_periodic_timer() and, in that event, get the size of the queue and print it.
If the number is not getting smaller you are at parity. If it's going up you are falling behind.

Resources