Blocking findAndModify in Ruby MongoDB Driver - ruby

I'm trying to achieve something like this in MonogDB:
require 'base64'
require 'mongo'
class MongoDBQueue
def enq(thing)
collection.insert({ payload: Base64.encode64(Marshal.dump(thing))})
end
alias :<< :enq
def deq
until _r = collection.find_and_modify({ sort: {_id: Mongo::ASCENDING}, remove: true})
Thread.pass
end
return Marshal.load(Base64.decode64(_r["payload"]))
end
alias :pop :deq
private
def collection
# database, collection & mongodb index semantics here
end
end
Naturally enough I want a Disk-backed queue in Ruby that doesn't destroy my available memory, I'm using this with the Anemone web spider framework which by default uses the Queue class, there's a fork which can use the SizedQueue class, however when using a SizedQueue for both the "page queue" and "links queue", it often deadlocks, presumably because it's trying to dequeue a page and process it, and it's found new links, and that situation cannot be reconciled.
There's also an existing implementation of a Redis queue, however that also exhausts all my available memory on this machine (Available memory is 16Gb, so it's not trivial)
Because of that I want to use this MongoDB backend, but I think the implementation is insane. The Thread.pass feels like a horrible solution, but Anemone is multi-threaded, and MongoDB doesn't support blocking reads, so it's a tricky situation.
Here's my references:
Redis queue implementation for anemone: https://github.com/chriskite/anemone/blob/queueadapter/lib/anemone/queue/redis.rb
MongoDB findAndModify: http://www.mongodb.org/display/DOCS/findAndModify+Command
Questions:
Can anyone comment about how sane this is, compared to sleep (which should trigger the VM to pass control to the next thread, anyway, but sleep feels dirtier)
Should I perhaps Thread.pass and sleep? ( I guess not, see above)
Can I make that read from MongoDB block? There was talk of that here, but never came to anything: https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/rqnHNFXaZ0w

1) Reads in MongoDB are blocking. If you do a findOne() or a findAndModify(), the call will not return until the data is present in the client side. If you do a find(), the call will not return until you get a cursor: you can then iterate on the cursor as much as you need.
2) By default, writes to MongoDB are "fire and forget". If you care about data integrity, you need to do either safe writes by setting :safe => true in your connection, database, or collection object

Kernel.sleep is actually a better solution, as otherwise you'll spin there (albeit passing control to other threads after each query).
As the findAndModify is atomic, only one thread (even on JRuby) will take the job, so I don't quite understand what's the "blocking" issue here.

Related

Multi-threading with Sequel gem

I have a program that is storing JSON request data into a Postgres DB using the Sequel gem (it's basically a price aggregator). However, the data is being pulled from multiple locations rapidly using threads.
Once I get the appropriate data, I currently have a mutex.synchronize with the following:
dbItem = Item.where(:sku => sku).first
dbItem = Item.create(:sku => sku, :name => itemName) if dbItem == nil
dbPrice = Price.create(:price => foundPrice, :quantity => quantity)
dbItem.add_price(dbPrice)
store.add_price(dbPrice)
If I run this without a mutex, I get threading issues - for example, the code will try to create an item in the DB, but the item will have just been created by another thread.
However, with the mutex, this code gets slowed down significantly - I'm seeing my program take ~four-six times longer.
I'm new to the whole database thing honestly, so I'm just trying to figure out the best way to handle threading. Am I doing this wrong? The documentation for Sequel actually states that almost everything threadsafe... except for model instances, which I believe my item situation falls under. It states I should freeze models first, but I don't know how to apply that here..
You aren't dealing with shared model instances, and this isn't a thread-safety issue, it is a race condition (it would happen using multiple processes, not just multiple threads). You need to use database-specific support to handle this in a race-condition free way. On PostgreSQL 9.5+, you would need to use Dataset#insert_conflict.
Sequel documentation: http://sequel.jeremyevans.net/rdoc-adapters/classes/Sequel/Postgres/DatasetMethods.html#method-i-insert_conflict
PostgreSQL documentation: https://www.postgresql.org/docs/9.5/static/sql-insert.html

How to memoize MySQL connection client cleverly in external module used from e.g. sinatra?

I think the question does not pin-point to the real problem, I have difficulties to nail it down precisely and concisely.
I have a gem that implements i.e. MySQL-database "queries" (also inserts, updates...)
module DBGEM::Query
def self.client settings=DBGEM.settings
##client ||= Mysql2::Client.new settings
end
def query_this
client.query(...)
end
def process_insert_that list_of_things
list_of_things.each do |thing|
# process
client.query(...)
end
end
Furthermore, this gem is used by a sinatra app sitting on a forking webserver like puma.
Within the sinatra-app i can now
get '/path' do
happy = DBGEM::Query.query_this
# process happy
great = DBGEM::Query.process_insert_that 1..20
# go on
end
I like that API and this code should open only one database connection.
But as far as I understood, because the code within the 'get' definition is not guaranteed to be the only one accessing the DBGEM::Query stuff at that time, weird things could happen (through race-conditions, shared internal state?).
Is there a clever way to keep the nice syntax and the connection sharing without boilerplate object creation (query = DBGEM::Query.new() #...) wrapping the stuff in a block (DBGEM::Query.process do |query| #...)?
The example above is obviously simplified. The sinatra handling might be more involved, the Queries actually done in a Service object etc.pp. Also, afaiu in a forking webserver environment, the GC would destroy the client (closing the connection - thats how mysql2 is implemented).
I think that the connection will not be closed every time.
##client is shared between DBGEM::Query object itself (in Ruby modules and classes are also objects) and all the instances of that object (to be precise: all the instances of classes to which that object is mixed in).
So, this variable will live as long as the DBGEM::Query object will live.
You can check out when DBGEM::Query object will be garbage collected, by defining finalizer logging a text and observe the server console.
module DBGEM::Query
ObjectSpace.define_finalizer(self, proc { print 'garbage collected' })
..
end
Im not sure, however I guess that DBGEM::Query object will be garbage collected only when you stop the server.
As it goes for weird "things could happen", I believe you mean potential conflicts, race conditions, situations where you create double records, or update the same record nearly at the same time overwriting something, etc. And when that happen you lose data integrity.
IMHO you can't prevent it by allowing only one client instance. I'd suggest aiming for solid database design (unique constrains, indexes, foreign keys, validations) which can raise errors when race condition occure and then handling that errors in your application.

How does one achieve parallel tasks with Ruby's Fibers?

I'm new to fibers and EventMachine, and have only recently found out about fibers when I was seeing if Ruby had any concurrency features, like go-lang.
There don't seem to be a whole lot of examples out there for real use cases for when you'd use a Fiber.
I did manage to find this: https://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/ (back from 2009!!!)
which has the following code:
require 'eventmachine'
require 'em-http'
require 'fiber'
def async_fetch(url)
f = Fiber.current
http = EventMachine::HttpRequest.new(url).get :timeout => 10
http.callback { f.resume(http) }
http.errback { f.resume(http) }
return Fiber.yield
end
EventMachine.run do
Fiber.new{
puts "Setting up HTTP request #1"
data = async_fetch('http://www.google.com/')
puts "Fetched page #1: #{data.response_header.status}"
EventMachine.stop
}.resume
end
And that's great, async GET request! yay!!! but... how do I actually use it asyncily? The example doesn't have anything beyond creating the containing Fiber.
From what I understand (and don't understand):
async_fetch is blocking until f.resume is called.
f is the current Fiber, which is the wrapping Fiber created in the EventMachine.run block.
the async_fetch yields control flow back to its caller? I'm not sure what this does
why does the wrapping fiber have resume at the end? are fibers paused by default?
Outside of the example, how do I use fibers to say, shoot off a bunch of requests triggered by keyboard commands?
like, for example: every time I type a letter, I make a request to google or something? - normally this requires a thread, which the main thread would tell the parallel thread to launch a thread for each request. :-\
I'm new to concurrency / Fibers. But they are so intriguing!
If anyone can answers these questions, that would be very appreciated!!!
There is a lot of confusion regarding fibers in Ruby. Fibers are not a tool with which to implement concurrency; they are merely a way of organizing code in a way that may more clearly represent what is going on.
That the name 'fibers' is similar to 'threads' in my opinion contributes to the confusion.
If you want true concurrency, that is, distributing the CPU load across all available CPU's, you have the following options:
In MRI Ruby
Running multiple Ruby VM's (i.e. OS processes), using fork, etc. Even with multiple threads in Ruby, the GIL (Global Interpreter Lock) prevents the use of more than 1 CPU by the Ruby runtime.
In JRuby
Unlike MRI Ruby, JRuby will use multiple CPU's when assigning threads, so you can get truly concurrent processing.
If your code is spending most of its time waiting for external resources, then you may not have any need for this true concurrency. MRI threads or some kind of event handling loop will probably work fine for you.

Passing success and failure handlers to an ActiveJob

I have an ActiveJob that's supposed to load a piece of data from an external system over HTTP. When that job completes, I want to queue a second job that does some postprocessing and then submits the data to a different external system.
I don't want the first job to know about the second job, because
encapsulation
reusability
it's none of the first job's business, basically
Likewise, I don't want the first job to care what happens next if the data-loading fails -- maybe the user gets notified, maybe we retry after a timeout, maybe we just log it and throw up our hands -- again it could vary based on the details of the exception, and there's no need for the job to include the logic for that or the connections to other systems to handle it.
In Java (which is where I have the most experience), I could use something like Guava's ListenableFuture to add success and failure callbacks after the fact:
MyDataLoader loader = new MyDataLoader(someDataSource)
ListenableFuture<Data> future = executor.submit(loader);
Futures.addCallback(future, new FutureCallback<Data>() {
public void onSuccess(Data result) {
processData(result);
}
public void onFailure(Throwable t) {
handleFailure(t);
}
});
ActiveJob, though, doesn't seem to provide this sort of external callback mechanism -- as best I can make out from relevant sections in "Active Job Basics", after_perform and rescue_from are only meant to be called from within the job class. And after_peform isn't meant to distinguish between success and failure.
So the best I've been able to come up with (and I'm not claiming it's very good) is to pass a couple of lambdas into the job's perform method, thus:
class MyRecordLoader < ActiveJob::Base
# Loads data expensively (hopefully on a background queue) and passes
# the result, or any exception, to the appropriate specified lambda.
#
# #param data_source [String] the URL to load data from
# #param on_success [-> (String)] A lambda that will be passed the record
# data, if it's loaded successfully
# #param on_failure [-> (Exception)] A lambda that will be passed any
# exception, if there is one
def perform(data_source, on_success, on_failure)
begin
result = load_data_expensively_from data_source
on_success.call(result)
rescue => exception
on_failure.call(exception)
end
end
end
(Side note: I have no idea what the yardoc syntax is for declaring lambdas as parameters. Does this look correct, or, failing that, plausible?)
The caller would then have to pass these in:
MyRecordLoader.perform_later(
some_data_source,
method(:process_data),
method(:handle_failure)
)
That's not terrible, at least on the calling side, but it seems clunky, and I can't help but suspect there's a common pattern for this that I'm just not finding. And I'm somewhat concerned that, as a Ruby/Rails novice, I'm just bending ActiveJob to do something it was never meant to do in the first place. All the ActiveJob examples I'm finding are 'fire and forget' -- asynchronously "returning" a result doesn't seem to be an ActiveJob use case.
Also, it's not clear to me that this will work at all in the case of a back-end like Resque that runs the jobs in a separate process.
What's "the Ruby way" to do this?
Update: As hinted at by dre-hh, ActiveJob turned out not to be the right tool here. It was also unreliable, and overcomplicated for the situation. I switched to Concurrent Ruby instead, which fits the use case better, and which, since the tasks are mostly IO-bound, is fast enough even on MRI, despite the GIL.
ActiveJob is not an async Library like a future or promise.
It is just an interface for performing tasks in a background. The current thread/process receives no result of this operation.
For example when using Sidekiq as ActiveJob queue, it will serialize the parameters of the perform method into the redis store. Another daemon process running within the context of your rails app will be watching the redis queue and instantiate your worker with the serialized data.
So passing callbacks might be alright, however why having them as methods on another class. Passing callbacks would make sense if those are dynamic (changing on different invocation). However as you have them implemented on the calling class, consider just moving those methods into your job worker class.

Does this code require a Mutex to access the #clients variable thread safely?

For fun I wrote this Ruby socket server which actually works quite nicely. I'm plannin on using it for the backend of an iOS App. My question for now is, when in the thread do I need a Mutex? Will I need one when accessing a shared variable such as #clients?
require 'rubygems'
require 'socket'
module Server
#server = Object.new
#clients = []
#sessions
def self.run(port=3000)
#server = TCPServer.new port
while (socket=#server.accept)
#clients << socket
Thread.start(socket) do |socket|
begin
loop do
begin
msg = String.new
while(data=socket.read_nonblock(1024))
msg << data
break if data.to_s.length < 1024
end
#clients.each do |client| client.write "#{socket} says: #{msg}" unless client == socket end
rescue
end
end
rescue => e
#clients.delete socket
puts e
puts "Killed client #{socket}"
Thread.kill self
end
end
end
end
end
Server.run
--Edit--
According to the answer from John Bollinger I need to synchronize the thread any time that a thread needs to access a shared resource. Does this apply to database queries? Can I read/write from a postgres database with ActiveRecord ORM inside multiple threads at once?
Any data that may be modified by one thread and read by a different one must be protected by a Mutex or a similar synchronization construct. Inasmuch as multiple threads may safely read the same data at the same time, a synchronization construct a bit more sophisticated than a single Mutex might yield better performance.
In your code, it looks like not only does #clients need to be properly synchronized, but so also do all its elements because writing to a socket is a modification.
Don't use a mutex unless you really have to.
It's pity the literature on Ruby multi-threading is so scarce, the only good book written on the topic is Working With Ruby Threads from Jesse Storimer. I've learned a lot of useful principles from there, one of which is: Don't use a mutex if there are better alternatives. In your case, there are. If you use Ruby without any gems, the only thread-safe data structure is a Queue. An array is not safe. However, with the thread_safe gem you can create one:
require 'thread_safe'
sa = ThreadSafe::Array.new # supports standard Array.new forms
sh = ThreadSafe::Hash.new # supports standard Hash.new forms
Regarding your question, it's only if any thread MODIFIES a shared data structure that you'll need to protect it with a mutex (assuming all the threads just read from that data structure, none writes to it, see John's comment for explanation on a case where you might need a mutex if one thread is reading, while another is writing to a thread etc). You don't need one for accessing unchanging data. If you're using Active Record + Postgres, yes Active Records IS thread safe, as for Postgres, you might want to follow these instructions (Behavior in Threaded Programs) to check that.
Also, be aware of race conditions (see How to Make ActiveRecord ThreadSafe which is one inherent problem which you should be aware of when coding multi-threaded apps).
Avdi Grimm had one very sound advice for multi-threaded apps: When testing them, make them fail loud and fast. So don't forget to add at the top:
Thread.abort_on_exception = true
so your threads don't silently fail if something wrong happens.

Resources