Is there a way to pass data to jobs with dashing? - ruby

I am working on a dashboard to display a few data points on a specific server. Obtaining this data requires non-trivial resources, so ideally it would only happen if there is a widget that is actively listening to it. Is this doable with dashing ? Example:
User surfs to /server-info?server_id=1234
This returns an HTML page with data-id="server-info-1234" AND somehow alerts the jobs/server-info.rb to start collecting data on "1234". Ideally once the tab is closed, the job is notified.
In essence, is there a way for the web views to send data/events to the jobs ?

No not really, however it may be possible to solve this another way.
Inside your job, once you require 'sinatra' you can see how many active connections are running like this: Sinatra::Application.settings.connections.
So, do something like this:
require 'sinatra'
current = 0
SCHEDULER.every '30s' do
if Sinatra::Application.settings.connections.length > 0
current = doMyCostlyOperation()
end
send_event('value', { current: current })
end

Related

sidekiq - runaway FIFO pipes created with large job

We are using Sidekiq to process a number of backend jobs. One in particular is used very heavily. All I can really say about it is that it sends emails. It doesn't do the email creation (that's a separate job), it just sends them. We spin up a new worker for each email that needs to be sent.
We are trying to upgrade to Ruby 3 and having problems, though. Ruby 2.6.8 has no issues; in 3 (as well as 2.7.3 IIRC), if there is a large number of queued workers, it will get through maybe 20K of them, then it will start hemorrhaging FIFO pipes, on the order of 300-1000 ever 5 seconds or so. Eventually it gets to the ulimit on the system (currently set at 64K) and all sockets/connections fail due to insufficient resources.
In trying to debug this issue I did a run with 90% of what the email worker does entirely commented out, so it does basically nothing except make a couple database queries and do some string templating. I thought I was getting somewhere with that approach, as one run (of 50K+ emails) succeeded without the pipe explosion. However, the next run (identical parameters) did wind up with the runaway pipes.
Profiling with rbspy and ruby-prof did not help much, as they primarily focus on the Sidekiq infrastructure, not the workers themselves.
Looking through our code, I did see that nothing we wrote is ever using IO.* (e.g. IO.popen, IO.select, etc), so I don't see what could be causing the FIFO pipes.
I did see https://github.com/mperham/sidekiq/wiki/Batches#huge-batches, which is not necessarily what we're doing. If you look at the code snippet below, we're basically creating one large batch. I'm not sure whether pushing jobs in bulk as per the link will help with the problem we're having, but I'm about to give it a try once I rework things a bit.
No matter what I do I can't seem to figure out the following:
What is making these pipes? Why are they being created?
What is the condition by which the pipes start getting made exponentially? There are two FIFO pipes that open when we start Sidekiq, but until enough work has been done, we don't see more than 2-6 pipes open generally.
Any advice is appreciated, even along the lines of where to look next, as I'm a bit stumped.
Initializer:
require_relative 'logger'
require_relative 'configuration'
require 'sidekiq-pro'
require "sidekiq-ent"
module Proprietary
unless const_defined?(:ENVIRONMENT)
ENVIRONMENT = ENV['RACK_ENV'] || ENV['RAILS_ENV'] || 'development'
end
# Sidekiq.client_middleware.add Sidekiq::Middleware::Client::Batch
REDIS_URL = if ENV["REDIS_URL"].present?
ENV["REDIS_URL"]
else
"redis://#{ENV["REDIS_SERVER"]}:#{ENV["REDIS_PORT"]}"
end
METRICS = Statsd.new "10.0.9.215", 8125
Sidekiq::Enterprise.unique! unless Proprietary::ENVIRONMENT == "test"
Sidekiq.configure_server do |config|
# require 'sidekiq/pro/reliable_fetch'
config.average_scheduled_poll_interval = 2
config.redis = {
namespace: Proprietary.config.SIDEKIQ_NAMESPACE,
url: Proprietary::REDIS_URL
}
config.server_middleware do |chain|
require 'sidekiq/middleware/server/statsd'
chain.add Sidekiq::Middleware::Server::Statsd, :client => METRICS
end
config.error_handlers << Proc.new do |ex,ctx_hash|
Proprietary.report_exception(ex, "Sidekiq", ctx_hash)
end
config.super_fetch!
config.reliable_scheduler!
end
Sidekiq.configure_client do |config|
config.redis = {
namespace: Proprietary.config.SIDEKIQ_NAMESPACE,
url: Proprietary::REDIS_URL,
size: 15,
network_timeout: 5
}
end
end
Code snippet (sanitized)
def add_targets_to_batch
#target_count = targets.count
queue_counter = 0
batch.jobs do
targets.shuffle.each do |target|
send(campaign_target)
queue_counter += 1
end
end
end
def send(campaign_target)
TargetEmailWorker.perform_async(target[:id],
guid,
is_draft ? target[:email_address] : nil)
begin
Target.where(id: target[:id]).update(send_at: Time.now.utc)
rescue Exception => ex
Proprietary.report_exception(ex, self.class.name, { target_id: target[:id], guid: guid })
end
end
end
First I tried auditing our external connections for connection pooling, etc. That did not help the issue. Eventually I got to the point where I disabled all external connections and let the job run doing virtually nothing outside of a database query and some logging. This allowed one run to complete without issue, but on the second one, the FIFO pipes still grew exponentially after a certain (variable) amount of work was done.

ActiveRecord Postgres database not locking - getting race conditions

I'm struggling with locking a PostgreSQL table I'm working on. Ideally I want to lock the entire table, but individual rows will do as long as they actually work.
I have several concurrent ruby scripts that all query a central jobs database on AWS (via a DatabaseAccessor class), find a job that hasn't yet been started, change the status to started and carry it out. The problem is, since these are all running at once, they'll typically all find the same unstarted job at once, and begin carrying it out, wasting time and muddying the results.
I've tried a bunch of things, .lock, .transaction, the fatalistic gem but they don't seem to be working, at least, not in pry.
My code is as follows:
class DatabaseAccessor
require 'pg'
require 'pry'
require 'active_record'
class Jobs < ActiveRecord::Base
enum status: [ :unstarted, :started, :slow, :completed]
end
def initialize(db_credentials)
ActiveRecord::Base.establish_connection(
adapter: db_credentials[:adapter],
database: db_credentials[:database],
username: db_credentials[:username],
password: db_credentials[:password],
host: db_credentials[:host]
)
end
def find_unstarted_job
job = Jobs.where(status: 0).limit(1)
job.started!
job
end
end
Does anyone have any suggestions?
EDIT: It seems that LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE; is the way to do this - however, I'm struggling with then returning the results of this after updating. RETURNING * will return the results after an update, but not inside a transaction.
SOLVED!
So the key here is locking in Postgres. There are a few different table-level locks, detailed here.
There are three factors here in making a decision:
Reads aren't thread safe. Two threads reading the same record will result in that job being run multiple times at once.
Records are only updated once (to be marked as completed) and created, other than the initial read and update to being started. Scripts that create new records will not read the table.
Reading varies in frequency. Waiting for an unlock is non-critical.
Given these factors, if there were a read-lock that still allowed writes, this would be acceptable, however, there isn't, so ACCESS EXCLUSIVE is our best option.
Given this, how do we deal with locking? A hunt through the ActiveRecord documentation gives no mention of it.
Thankfully, other methods to deal with PostgreSQL exist, namely the ruby-pg gem. A bit of a play with SQL later, and a test of locking, and I get the following method:
def converter
result_hash = {}
conn = PG::Connection.open(:dbname => 'my_db')
conn.exec("BEGIN WORK;
LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE;")
conn.exec("UPDATE jobs SET status = 1 WHERE id =
(SELECT id FROM jobs WHERE status = 0 ORDER BY ID LIMIT 1)
RETURNING *;") do |result|
result.each { |row| result_hash = row }
end
conn.exec("COMMIT WORK;")
result_hash.transform_keys!(&:to_sym)
end
This will result in:
An output of an empty hash if there are no jobs with a status of 0
An output of a symbolized hash if one is found and updated
Sleeping if the database is currently locked, before returning the above once unlocked.
The table will remain locked until the COMMIT WORK statement.
As an aside, I wish there was a cleaner way to convert the result to a hash. If anyone has any suggestions, please let me know in the comments! :)

Can't seem to run a process in background of Sinatra app

I'm trying to display a number from an api, but I want my page to load faster. So, I'd like to get the number from the api every 5 minutes, and just load that number to my page. This is what I have.
get '/' do
x = Numbersapi.new
#number = x.number
:erb home
end
This works fine, but getting that number from the api takes a while so that means my page takes a while to load. I want to look up that number ahead of time and then every 5 minutes. I've tried using threads and processes, but I can't seem to figure it out. I'm still pretty new to programming.
Here's a pretty simple way to get data in a separate thread. Somewhere outside of the controller action, fire off the async loop:
Data = {}
numbers_api = Numbersapi.new
Thread.new do
Data[:number] = numbers_api.number
sleep 300 # 5 minutes
end
Then in your controller action, you can simply refer to the Data[:number], and you'll get the latest value.
However if you're deploying this you should use a gem like Resque or Sidekiq; it will track failures and is probably optimized more

Dynamically loading new jobs in a SidekiqStatus container to monitor completion

I built a small web crawler implemented in two Sidekiq workers: Crawler and Parsing. The Crawler worker will seek for links while Parsing worker will read the page body.
I want to trigger an alert when the crawling/parsing of all pages is complete. Monitoring only the Crawler job is not the best solution since it may have finished but there might be several Parser jobs running.
Having a look at sidekiq-status gem it seems that I cannot dynamically add new jobs to the container for monitoring. E.g. it would be nice to have a "add" method in the following context:
#container = SidekiqStatus::Container.new
# ... for each page url found:
jid = ParserWorker.perform_async(page_url)
#container.add(jid)
The closest to this is to use "SidekiqStatus::Container.load" or "SidekiqStatus::Container.load_multi" however, it is not possible to add new jobs in the container a posteriori.
One solution would be to create as many SidekiqStatus::Container instances as the number of ParserJobs and check if all of them have status == "finished", but I wonder if a more elegant solution exists using these tools.
Any help is appreciated.
You are describing Sidekiq Pro's Batches feature exactly. You can spend a lot of time or some money to solve your problem.
https://github.com/mperham/sidekiq/wiki/Batches
OK, here's a simple solution. Using the sidekiq-status gem, the Crawler worker keeps track of the jobs IDs for the Parser jobs and halts if any Parser job is still busy (using the SidekiqStatus::Container instance to check job status).
def perform()
# for each page....
#jids << ParserWorker.perform_async(page_url)
# end
# crawler finished, parsers may still be running
while parsers_busy?
sleep 5 # wait 5 secs between each check
end
# all parsers complete, trigger notification...
end
def parsers_busy?
status_containers = SidekiqStatus::Container.load_multi(#jids)
for container in status_containers
if container.status == 'waiting' || container.status == 'working'
return true
end
end
return false
end

Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.
I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.
Ideas?
Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.
The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.
Paul Dix, the original author, talked about his design goals on his blog.
This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby
require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'
BASE_URL = ''
url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)
hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
gzip_url = url.join(gzip)
request = Typhoeus::Request.new(gzip_url.to_s)
request.on_complete do |resp|
gzip_filename = resp.request.url.split('/').last
puts "writing #{gzip_filename}"
File.open("gz/#{gzip_filename}", 'w') do |fo|
fo.write resp.body
end
end
puts "queuing #{ gzip }"
hydra.queue(request)
end
hydra.run
Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.
I give it a 8 out of 10; It's got a great beat and I can dance to it.
EDIT:
When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.
I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm
From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.
You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.
To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.
The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 10
urls.each do |url|
wq.enqueue_b do
response = Net::HTTP.get_response(uri)
puts response.code
end
end
wq.join

Resources