De-dupe Sidekiq queues - ruby

How could I de-dupe all Sidekiq queues, ie ensure each job in the queue has unique worker and arguments.
(This arises because, for example, an object is saved twice, triggering some new job each time; but we only want it to be processed. So I'm looking to periodically de-dupe queues.)

You can use sidekiq unique jobs gem - looks like it actually does what you need.
Added later:
Here is basic implementation of what you are asking for - it would not be fast, but should be OK for small queues. I've also met this problem when repacking JSON - in my environment it was necessary to re-encode json the same way.
#for proper json packing (I had an issue with it while testing)
require 'bigdecimal'
class BigDecimal
def as_json(options = nil) #:nodoc:
if finite?
self
else
NilClass::AS_JSON
end
end
end
Sidekiq.redis do |connection|
# getting items from redis
items_count = connection.llen('queue:background')
items = connection.lrange('queue:background', 0, 100)
# remove retrieved items
connection.lrem('queue:background', 0, 100)
# jobs are in json - decode them
items_decoded = items.map{|item| ActiveSupport::JSON.decode(item)}
# group them by class and arguments
grouped = items_decoded.group_by{|item| [item['class'], item['args']]}
# get second and so forth from each group
duplicated = grouped.values.delete_if{|mini_list| mini_list.length < 2}
for_deletion = duplicated.map{|a| a[0...-1]}.flatten
for_deletion_packed = for_deletion.map{|item| JSON.generate(item)}
# removing duplicates one by one
for_deletion_packed.each do |packed_item|
connection.lrem('queue:background', 0, packed_item)
end
end

Related

Thread in Parallel gem Ruby

I am using sidekiq gem for queue. and I want to process my executing parallely inside the queue.
here is my code for queue
def perform(disbursement_id)
some logic...
Parallel.each(disbursement.employee_disbursements, in_threads: 2) do |employee|
amount = amount_format(employee.amount)
res = unload_company_account(cmp_acc_id, amount.to_s)
load_employee_account(employee) unless res.empty?
end
end
Now when I use Parallel.each() without threads it works good, but when i use Parallel.each(.., in_threads:3) it goes to busy state of queue.
Not sure why in_threads takes my queue to busy state. I am not able to resolve it.
Try next to make it work
Parallel.each(disbursement.employee_disbursements, in_threads: 2) do |employee|
ActiveRecord::Base.connection_pool.with_connection do
amount = amount_format(employee.amount)
res = unload_company_account(cmp_acc_id, amount.to_s)
load_employee_account(employee) unless res.empty?
end
end
Also, that issue go away when use map instead of each or pass attribute preserve_results as true or false. That is a bit mystery because:
def each(array, options={}, &block)
map(array, options.merge(:preserve_results => false), &block)
end

Ruby Backburner Job Results

I'm setting up Backburner as a work queue, and my job items need to return JSON for the resulting data they create. I'm not sure how to structure this. As a test I've tried doing:
class PrintJob
include Backburner::Performable
def self.print(text)
puts text
return "results"
end
end
Backburner.configure do |config|
config.beanstalk_url = ["beanstalk://127.0.0.1"]
# etc
end
val = PrintJob.async.print('some cool text')
puts val
and running Backburner.work inside IRB. The puts works but the return value comes back as true instead of "results".
Is there a way to get return values out of async methods? Or should I try a different approach, e.g. having one queue for jobs and another for results? If so, how can I associate the result 'job' with the original work it belongs to?
Note: I'm eventually using Sinatra and not Rails.

How should I handle this use case using EventMachine?

I have an application that reacts to messages sent by clients. One message is reload_credentials, that the application receives any time a new client registers. This message will then connect to a PostgreSQL database, do a query for all the credentials, and then store them in a regular Ruby hash ( client_id => client_token ).
Some other messages that the application may receive are start,stop,pause which are used to keep track of some session times. My point is that I envision the application functioning in the following way:
client sends a message
message gets queued
queue is being processed
However, for example, I don't want to block the reactor. Furthermore, let's imagine I have a reload_credentials message that's next in queue. I don't want any other message from the queue to be processed until the credentials are reloaded from the DB. Also, while I am processing a certain message ( like waiting for the credentials query to finish) , I want to allow other messages to be enqueued .
Could you please guide me towards solving such a problem? I'm thinking I may have to use em-synchrony, but I am not sure.
Use one of the Postgresql EM drivers, or EM.defer so that you won't block the reactor.
When you receive the 'reload_credentials' message just flip a flag that causes all subsequent messages to be enqueued. Once the 'reload_credentials' has finished, process all messages from the queue. After the queue is empty flip the flag that causes messages to be processed as they are received.
EM drivers for Postgresql are listed here: https://github.com/eventmachine/eventmachine/wiki/Protocol-Implementations
module Server
def post_init
#queue = []
#loading_credentials = false
end
def recieve_message(type, data)
return #queue << [type, data] if #loading_credentials || !#queue.empty?
return process_msg(type, data) unless :reload_credentials == type
#loading_credentials = true
reload_credentials do
#loading_credentials = false
process_queue
end
end
def reload_credentials(&when_done)
EM.defer( proc { query_and_load_credentials }, when_done )
end
def process_queue
while (type, data = #queue.shift)
process_msg(type, data)
end
end
# lots of other methods
end
EM.start_server(HOST, PORT, Server)
If you want all connections to queue messages whenever any connection receives a 'reload_connections' message you'll have to coordinate via the eigenclass.
The following is I presume, something like your current implementation:
class Worker
def initialize queue
#queue = queue
dequeue
end
def dequeue
#queue.pop do |item|
begin
work_on item
ensure
dequeue
end
end
end
def work_on item
case item.type
when :reload_credentials
# magic happens here
else
# more magic happens here
end
end
end
q = EM::Queue.new
workers = Array.new(10) { Worker.new q }
The problem above, if I understand you correctly, is that you don't want workers working on new jobs (jobs that have arrived earlier in the producer timeline), than any reload_credentials jobs. The following should service this (additional words of caution at the end).
class Worker
def initialize queue
#queue = queue
dequeue
end
def dequeue
#queue.pop do |item|
begin
work_on item
ensure
dequeue
end
end
end
def work_on item
case item.type
when :reload_credentials
# magic happens here
else
# more magic happens here
end
end
end
class LockingDispatcher
def initialize channel, queue
#channel = channel
#queue = queue
#backlog = []
#channel.subscribe method(:dispatch_with_locking)
#locked = false
end
def dispatch_with_locking item
if locked?
#backlog << item
else
# You probably want to move the specialization here out into a method or
# block that's passed into the constructor, to make the lockingdispatcher
# more of a generic processor
case item.type
when :reload_credentials
lock
deferrable = CredentialReloader.new(item).start
deferrable.callback { unlock }
deferrable.errback { unlock }
else
dispatch_without_locking item
end
end
end
def dispatch_without_locking item
#queue << item
end
def locked?
#locked
end
def lock
#locked = true
end
def unlock
#locked = false
bl = #backlog.dup
#backlog.clear
bl.each { |item| dispatch_with_locking item }
end
end
channel = EM::Channel.new
queue = EM::Queue.new
dispatcher = LockingDispatcher.new channel, queue
workers = Array.new(10) { Worker.new queue }
So, input to the first system comes in on q, but in this new system it comes in on channel. The queue is still used for work distribution among workers, but the queue is not populated while a refresh credentials operation is going on. Unfortunately, as I didn't take more time, I have not generalized the LockingDispatcher such that it isn't coupled with the item type and code for dispatching CredentialsReloader. I'll leave that to you.
You should note here that whilst this services what I understand of your original request, it is generally better to relax this kind of requirement. There are several outstanding problems that essentially cannot be eradicated without alterations in that requirement:
The system does not wait for executing jobs to complete before starting credentials jobs
The system will handle bursts of credentials jobs very badly - other items that might be processable, won't be.
In the case of a bug in the credentials code, the backlog could fill up ram and cause failure. A simple timeout might be enough to avoid catastrophic effects, iff the code is abortable, and subsequent messages are sufficiently processable to avoid further deadlocks.
It actually sounds like you have some notion of a userid in the system. If you think through your requirements, it's likely possible that you only need to backlog items that pertain to a userid who's credentials are in a refresh state. This is a different problem, that involves a different kind of dispatching. Try a hash of locked backlogs for those users, with a callback on credential completion to drain those backlogs into the workers, or some similar arrangement.
Good luck!

ruby: multiple identical or synced instances of mechanize?

As far as I know, I read elsewhere that ruby mechanize is not thread save. Thus, to accelerate some 'gets', I opted to instantiate several independent Mechanize objects and use them in parallel. This seems to work OK
BTW, I would like to make all instances as similar as possible, as similar as sharing 'everything' they could know (cookies, etc).
Is there any way to make deep copies of an already 'configured' Mechanize object. My aim is to only configure one of them and copy make clones of it.
For instance, if I can create a Mechanize object like this (only an example, but suppose there are a lot more of configured attributes):
agent = Mechanize.new { |a| a.read_timeout = 20; a.max_history = 1 }
How can I get copies of that don't interfere each other while 'get'ing?.
agent2 = agent.dup # are not thread save copies
agent2 = Marshal.load(Marshal.dump(agent)) # thorws an error
This appears to work until you change a value for max_history or read_timeout.
class Mechanize
def clone
Mechanize.new do |a|
a.cookie_jar = cookie_jar
a.max_history = max_history
a.read_timeout = read_timeout
end
end
end
Testing:
agent1 = Mechanize.new { |a| a.max_history = 30; a.read_timeout = 30 }
agent2 = agent1.clone
agent2.max_history == 30 # true
agent2.cookie_jar == agent1.cookie_jar # true

Sinatra with a persistent variable

My sinatra app has to parse a ~60MB XML-file. This file hardly ever changes: on a nightly cron job, It is overwritten with another one.
Are there tricks or ways to keep the parsed file in memory, as a variable, so that I can read from it on incoming requests, but not have to parse it over and over for each incoming request?
Some Pseudocode to illustrate my problem.
get '/projects/:id'
return #nokigiri_object.search("//projects/project[#id=#{params[:id]}]/name/text()")
end
post '/projects/update'
if params[:token] == "s3cr3t"
#nokogiri_object = reparse_the_xml_file
end
end
What I need to know, is how to create such a #nokogiri_object so that it persists when Sinatra runs. Is that possible at all? Or do I need some storage for that?
You could try:
configure do
##nokogiri_object = parse_xml
end
Then ##nokogiri_object will be available in your request methods. It's a class variable rather than an instance variable, but should do what you want.
The proposed solution gives a warning
warning: class variable access from toplevel
You can use a class method to access the class variable and the warning will disappear
require 'sinatra'
class Cache
##count = 0
def self.init()
##count = 0
end
def self.increment()
##count = ##count + 1
end
def self.count()
return ##count
end
end
configure do
Cache::init()
end
get '/' do
if Cache::count() == 0
Cache::increment()
"First time"
else
Cache::increment()
"Another time #{Cache::count()}"
end
end
Two options:
Save the parsed file to a new file and always read that one.
You can save in a file – serialize - a hash with two keys: 'last-modified' and 'data'.
The 'last-modified' value is a date and you check in every request if that day is today. If it is not today then a new file is downloaded, parsed and stored with today's date.
The 'data' value is the parsed file.
That way you parse just once time, sort of a cache.
Save the parsed file to a NoSQL database, for example redis.

Resources