Dynamically loading new jobs in a SidekiqStatus container to monitor completion - ruby

I built a small web crawler implemented in two Sidekiq workers: Crawler and Parsing. The Crawler worker will seek for links while Parsing worker will read the page body.
I want to trigger an alert when the crawling/parsing of all pages is complete. Monitoring only the Crawler job is not the best solution since it may have finished but there might be several Parser jobs running.
Having a look at sidekiq-status gem it seems that I cannot dynamically add new jobs to the container for monitoring. E.g. it would be nice to have a "add" method in the following context:
#container = SidekiqStatus::Container.new
# ... for each page url found:
jid = ParserWorker.perform_async(page_url)
#container.add(jid)
The closest to this is to use "SidekiqStatus::Container.load" or "SidekiqStatus::Container.load_multi" however, it is not possible to add new jobs in the container a posteriori.
One solution would be to create as many SidekiqStatus::Container instances as the number of ParserJobs and check if all of them have status == "finished", but I wonder if a more elegant solution exists using these tools.
Any help is appreciated.

You are describing Sidekiq Pro's Batches feature exactly. You can spend a lot of time or some money to solve your problem.
https://github.com/mperham/sidekiq/wiki/Batches

OK, here's a simple solution. Using the sidekiq-status gem, the Crawler worker keeps track of the jobs IDs for the Parser jobs and halts if any Parser job is still busy (using the SidekiqStatus::Container instance to check job status).
def perform()
# for each page....
#jids << ParserWorker.perform_async(page_url)
# end
# crawler finished, parsers may still be running
while parsers_busy?
sleep 5 # wait 5 secs between each check
end
# all parsers complete, trigger notification...
end
def parsers_busy?
status_containers = SidekiqStatus::Container.load_multi(#jids)
for container in status_containers
if container.status == 'waiting' || container.status == 'working'
return true
end
end
return false
end

Related

sidekiq - runaway FIFO pipes created with large job

We are using Sidekiq to process a number of backend jobs. One in particular is used very heavily. All I can really say about it is that it sends emails. It doesn't do the email creation (that's a separate job), it just sends them. We spin up a new worker for each email that needs to be sent.
We are trying to upgrade to Ruby 3 and having problems, though. Ruby 2.6.8 has no issues; in 3 (as well as 2.7.3 IIRC), if there is a large number of queued workers, it will get through maybe 20K of them, then it will start hemorrhaging FIFO pipes, on the order of 300-1000 ever 5 seconds or so. Eventually it gets to the ulimit on the system (currently set at 64K) and all sockets/connections fail due to insufficient resources.
In trying to debug this issue I did a run with 90% of what the email worker does entirely commented out, so it does basically nothing except make a couple database queries and do some string templating. I thought I was getting somewhere with that approach, as one run (of 50K+ emails) succeeded without the pipe explosion. However, the next run (identical parameters) did wind up with the runaway pipes.
Profiling with rbspy and ruby-prof did not help much, as they primarily focus on the Sidekiq infrastructure, not the workers themselves.
Looking through our code, I did see that nothing we wrote is ever using IO.* (e.g. IO.popen, IO.select, etc), so I don't see what could be causing the FIFO pipes.
I did see https://github.com/mperham/sidekiq/wiki/Batches#huge-batches, which is not necessarily what we're doing. If you look at the code snippet below, we're basically creating one large batch. I'm not sure whether pushing jobs in bulk as per the link will help with the problem we're having, but I'm about to give it a try once I rework things a bit.
No matter what I do I can't seem to figure out the following:
What is making these pipes? Why are they being created?
What is the condition by which the pipes start getting made exponentially? There are two FIFO pipes that open when we start Sidekiq, but until enough work has been done, we don't see more than 2-6 pipes open generally.
Any advice is appreciated, even along the lines of where to look next, as I'm a bit stumped.
Initializer:
require_relative 'logger'
require_relative 'configuration'
require 'sidekiq-pro'
require "sidekiq-ent"
module Proprietary
unless const_defined?(:ENVIRONMENT)
ENVIRONMENT = ENV['RACK_ENV'] || ENV['RAILS_ENV'] || 'development'
end
# Sidekiq.client_middleware.add Sidekiq::Middleware::Client::Batch
REDIS_URL = if ENV["REDIS_URL"].present?
ENV["REDIS_URL"]
else
"redis://#{ENV["REDIS_SERVER"]}:#{ENV["REDIS_PORT"]}"
end
METRICS = Statsd.new "10.0.9.215", 8125
Sidekiq::Enterprise.unique! unless Proprietary::ENVIRONMENT == "test"
Sidekiq.configure_server do |config|
# require 'sidekiq/pro/reliable_fetch'
config.average_scheduled_poll_interval = 2
config.redis = {
namespace: Proprietary.config.SIDEKIQ_NAMESPACE,
url: Proprietary::REDIS_URL
}
config.server_middleware do |chain|
require 'sidekiq/middleware/server/statsd'
chain.add Sidekiq::Middleware::Server::Statsd, :client => METRICS
end
config.error_handlers << Proc.new do |ex,ctx_hash|
Proprietary.report_exception(ex, "Sidekiq", ctx_hash)
end
config.super_fetch!
config.reliable_scheduler!
end
Sidekiq.configure_client do |config|
config.redis = {
namespace: Proprietary.config.SIDEKIQ_NAMESPACE,
url: Proprietary::REDIS_URL,
size: 15,
network_timeout: 5
}
end
end
Code snippet (sanitized)
def add_targets_to_batch
#target_count = targets.count
queue_counter = 0
batch.jobs do
targets.shuffle.each do |target|
send(campaign_target)
queue_counter += 1
end
end
end
def send(campaign_target)
TargetEmailWorker.perform_async(target[:id],
guid,
is_draft ? target[:email_address] : nil)
begin
Target.where(id: target[:id]).update(send_at: Time.now.utc)
rescue Exception => ex
Proprietary.report_exception(ex, self.class.name, { target_id: target[:id], guid: guid })
end
end
end
First I tried auditing our external connections for connection pooling, etc. That did not help the issue. Eventually I got to the point where I disabled all external connections and let the job run doing virtually nothing outside of a database query and some logging. This allowed one run to complete without issue, but on the second one, the FIFO pipes still grew exponentially after a certain (variable) amount of work was done.

Can't seem to run a process in background of Sinatra app

I'm trying to display a number from an api, but I want my page to load faster. So, I'd like to get the number from the api every 5 minutes, and just load that number to my page. This is what I have.
get '/' do
x = Numbersapi.new
#number = x.number
:erb home
end
This works fine, but getting that number from the api takes a while so that means my page takes a while to load. I want to look up that number ahead of time and then every 5 minutes. I've tried using threads and processes, but I can't seem to figure it out. I'm still pretty new to programming.
Here's a pretty simple way to get data in a separate thread. Somewhere outside of the controller action, fire off the async loop:
Data = {}
numbers_api = Numbersapi.new
Thread.new do
Data[:number] = numbers_api.number
sleep 300 # 5 minutes
end
Then in your controller action, you can simply refer to the Data[:number], and you'll get the latest value.
However if you're deploying this you should use a gem like Resque or Sidekiq; it will track failures and is probably optimized more

Can you determine the number of workers running in your application

In order to properly scale our sidekiq workers to the size of our database pool, we came up with a little formula in our configuration
sidekiq.rb
Sidekiq.configure_server do |config|
config.options[:concurrency] = ((ENV['DB_POOL'] || 5).to_i - 1) / workers
end
def workers
... the number of workers configured for our project ...
(ENV['HEROKU_WORKERS'] || 1).to_i
end
We're setting HEROKU_WORKERS by hand, but it would be sweet if there was a way to interrogate the Heroku API from within the application.
Modulo all the things that can happen (workers going up or down, changing the number of workers, etc.), this seems to get us out of the initial problem; where our workers would consume all of the database pool connections, and then start crashing on startup.
The heroku-api gem should provide you this.
https://github.com/heroku/heroku.rb
You should find your API key here: https://dashboard.heroku.com/account
require 'heroku-api'
heroku = Heroku::API.new(api_key: API_KEY)
Total number of current processes:
heroku.get_ps('heroku-app-name').body.count
(You should be able to parse this to get total number of workers... or a count of a specific kind of worker, if you have different kinds defined in your Procfile/Heroku app)

Ruby and Rails Async

I need to perform long-running operation in ruby/rails asynchronously.
Googling around one of the options I find is Sidekiq.
class WeeklyReportWorker
include Sidekiq::Worker
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
report = WeeklyReport.build(user, product, year, week)
report.save
end
end
# call WeeklyReportWorker.perform_async('user', 'product')
Everything works great! But there is a problem.
If I keep calling this async method every few seconds, but the actual time heavy operation performs is one minute things won't work.
Let me put it in example.
5.times { WeeklyReportWorker.perform_async('user', 'product') }
Now my heavy operation will be performed 5 times. Optimally it should have performed only once or twice depending on whether execution of first operaton started before 5th async call was made.
Do you have tips how to solve it?
Here's a naive approach. I'm a resque user, maybe sidekiq has something better to offer.
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
# first, make a name for lock key. For example, include all arguments
# there, so that another perform with the same arguments won't do any work
# while the first one is still running
lock_key_name = make_lock_key_name(user, product, year, week)
Sidekiq.redis do |redis| # sidekiq uses redis, let us leverage that
begin
res = redis.incr lock_key_name
return if res != 1 # protection from race condition. Since incr is atomic,
# the very first one will set value to 1. All subsequent
# incrs will return greater values.
# if incr returned not 1, then another copy of this
# operation is already running, so we quit.
# finally, perform your business logic here
report = WeeklyReport.build(user, product, year, week)
report.save
ensure
redis.del lock_key_name # drop lock key, so that operation may run again.
end
end
end
I am not sure I understood your scenario well, but how about looking at this gem:
https://github.com/collectiveidea/delayed_job
So instead of doing:
5.times { WeeklyReportWorker.perform_async('user', 'product') }
You can do:
5.times { WeeklyReportWorker.delay.perform('user', 'product') }
Out of the box, this will make the worker process the second job after the first job, but only if you use the default settings (because by default the worker process is only one).
The gem offers possibilities to:
Put jobs on a queue;
Have different queues for different jobs if that is required;
Have more than one workers to process a queue (for example, you can start 4 workers on a 4-CPU machine for higher efficiency);
Schedule jobs to run at exact times, or after set amount of time after queueing the job. (Or, by default, schedule for immediate background execution).
I hope it can help you as you did to me.

Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.
I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.
Ideas?
Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.
The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.
Paul Dix, the original author, talked about his design goals on his blog.
This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby
require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'
BASE_URL = ''
url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)
hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
gzip_url = url.join(gzip)
request = Typhoeus::Request.new(gzip_url.to_s)
request.on_complete do |resp|
gzip_filename = resp.request.url.split('/').last
puts "writing #{gzip_filename}"
File.open("gz/#{gzip_filename}", 'w') do |fo|
fo.write resp.body
end
end
puts "queuing #{ gzip }"
hydra.queue(request)
end
hydra.run
Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.
I give it a 8 out of 10; It's got a great beat and I can dance to it.
EDIT:
When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.
I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm
From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.
You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.
To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.
The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 10
urls.each do |url|
wq.enqueue_b do
response = Net::HTTP.get_response(uri)
puts response.code
end
end
wq.join

Resources