How to memoize MySQL connection client cleverly in external module used from e.g. sinatra? - ruby

I think the question does not pin-point to the real problem, I have difficulties to nail it down precisely and concisely.
I have a gem that implements i.e. MySQL-database "queries" (also inserts, updates...)
module DBGEM::Query
def self.client settings=DBGEM.settings
##client ||= Mysql2::Client.new settings
end
def query_this
client.query(...)
end
def process_insert_that list_of_things
list_of_things.each do |thing|
# process
client.query(...)
end
end
Furthermore, this gem is used by a sinatra app sitting on a forking webserver like puma.
Within the sinatra-app i can now
get '/path' do
happy = DBGEM::Query.query_this
# process happy
great = DBGEM::Query.process_insert_that 1..20
# go on
end
I like that API and this code should open only one database connection.
But as far as I understood, because the code within the 'get' definition is not guaranteed to be the only one accessing the DBGEM::Query stuff at that time, weird things could happen (through race-conditions, shared internal state?).
Is there a clever way to keep the nice syntax and the connection sharing without boilerplate object creation (query = DBGEM::Query.new() #...) wrapping the stuff in a block (DBGEM::Query.process do |query| #...)?
The example above is obviously simplified. The sinatra handling might be more involved, the Queries actually done in a Service object etc.pp. Also, afaiu in a forking webserver environment, the GC would destroy the client (closing the connection - thats how mysql2 is implemented).

I think that the connection will not be closed every time.
##client is shared between DBGEM::Query object itself (in Ruby modules and classes are also objects) and all the instances of that object (to be precise: all the instances of classes to which that object is mixed in).
So, this variable will live as long as the DBGEM::Query object will live.
You can check out when DBGEM::Query object will be garbage collected, by defining finalizer logging a text and observe the server console.
module DBGEM::Query
ObjectSpace.define_finalizer(self, proc { print 'garbage collected' })
..
end
Im not sure, however I guess that DBGEM::Query object will be garbage collected only when you stop the server.
As it goes for weird "things could happen", I believe you mean potential conflicts, race conditions, situations where you create double records, or update the same record nearly at the same time overwriting something, etc. And when that happen you lose data integrity.
IMHO you can't prevent it by allowing only one client instance. I'd suggest aiming for solid database design (unique constrains, indexes, foreign keys, validations) which can raise errors when race condition occure and then handling that errors in your application.

Related

Run when you can

In my sinatra web application, I have a route:
get "/" do
temp = MyClass.new("hello",1)
redirect "/home"
end
Where MyClass is:
class MyClass
#instancesArray = []
def initialize(string,id)
#string = string
#id = id
#instancesArray[id] = this
end
def run(id)
puts #instancesArray[id].string
end
end
At some point I would want to run MyClass.run(1), but I wouldn't want it to execute immediately because that would slow down the servers response to some clients. I would want the server to wait to run MyClass.run(temp) until there was some time with a lighter load. How could I tell it to wait until there is an empty/light load, then run MyClass.run(temp)? Can I do that?
Addendum
Here is some sample code for what I would want to do:
$var = 0
get "/" do
$var = $var+1 # each time a request is recieved, it incriments
end
After that I would have a loop that would count requests/minute (so after a minute it would reset $var to 0, and if $var was less than some number, then it would run tasks util the load increased.
As Andrew mentioned (correctly—not sure why he was voted down), Sinatra stops processing a route when it sees a redirect, so any subsequent statements will never execute. As you stated, you don't want to put those statements before the redirect because that will block the request until they complete. You could potentially send the redirect status and header to the client without using the redirect method and then call MyClass#run. This will have the desired effect (from the client's perspective), but the server process (or thread) will block until it completes. This is undesirable because that process (or thread) will not be able to serve any new requests until it unblocks.
You could fork a new process (or spawn a new thread) to handle this background task asynchronously from the main process associated with the request. Unfortunately, this approach has the potential to get messy. You would have to code around different situations like the background task failing, or the fork/spawn failing, or the main request process not ending if it owns a running thread or other process. (Disclaimer: I don't really know enough about IPC in Ruby and Rack under different application servers to understand all of the different scenarios, but I'm confident that here there be dragons.)
The most common solution pattern for this type of problem is to push the task into some kind of work queue to be serviced later by another process. Pushing a task onto the queue is ideally a very quick operation, and won't block the main process for more than a few milliseconds. This introduces a few new challenges (where is the queue? how is the task described so that it can be facilitated at a later time without any context? how do we maintain the worker processes?) but fortunately a lot of the leg work has already been done by other people. :-)
There is the delayed_job gem, which seems to provide a nice all-in-one solution. Unfortunately, it's mostly geared towards Rails and ActiveRecord, and the efforts people have made in the past to make it work with Sinatra look to be unmaintained. The contemporary, framework-agnostic solutions are Resque and Sidekiq. It might take some effort to get up and running with either option, but it would be well worth it if you have several "run when you can" type functions in your application.
MyClass.run(temp) is never actually executing. In your current request to / path you instantiate a new instance of MyClass then it will immediately do a get request to /home. I'm not entirely sure what the question is though. If you want something to execute after the redirect, that functionality needs to exist within the /home route.
get '/home' do
# some code like MyClass.run(some_arg)
end

Passing success and failure handlers to an ActiveJob

I have an ActiveJob that's supposed to load a piece of data from an external system over HTTP. When that job completes, I want to queue a second job that does some postprocessing and then submits the data to a different external system.
I don't want the first job to know about the second job, because
encapsulation
reusability
it's none of the first job's business, basically
Likewise, I don't want the first job to care what happens next if the data-loading fails -- maybe the user gets notified, maybe we retry after a timeout, maybe we just log it and throw up our hands -- again it could vary based on the details of the exception, and there's no need for the job to include the logic for that or the connections to other systems to handle it.
In Java (which is where I have the most experience), I could use something like Guava's ListenableFuture to add success and failure callbacks after the fact:
MyDataLoader loader = new MyDataLoader(someDataSource)
ListenableFuture<Data> future = executor.submit(loader);
Futures.addCallback(future, new FutureCallback<Data>() {
public void onSuccess(Data result) {
processData(result);
}
public void onFailure(Throwable t) {
handleFailure(t);
}
});
ActiveJob, though, doesn't seem to provide this sort of external callback mechanism -- as best I can make out from relevant sections in "Active Job Basics", after_perform and rescue_from are only meant to be called from within the job class. And after_peform isn't meant to distinguish between success and failure.
So the best I've been able to come up with (and I'm not claiming it's very good) is to pass a couple of lambdas into the job's perform method, thus:
class MyRecordLoader < ActiveJob::Base
# Loads data expensively (hopefully on a background queue) and passes
# the result, or any exception, to the appropriate specified lambda.
#
# #param data_source [String] the URL to load data from
# #param on_success [-> (String)] A lambda that will be passed the record
# data, if it's loaded successfully
# #param on_failure [-> (Exception)] A lambda that will be passed any
# exception, if there is one
def perform(data_source, on_success, on_failure)
begin
result = load_data_expensively_from data_source
on_success.call(result)
rescue => exception
on_failure.call(exception)
end
end
end
(Side note: I have no idea what the yardoc syntax is for declaring lambdas as parameters. Does this look correct, or, failing that, plausible?)
The caller would then have to pass these in:
MyRecordLoader.perform_later(
some_data_source,
method(:process_data),
method(:handle_failure)
)
That's not terrible, at least on the calling side, but it seems clunky, and I can't help but suspect there's a common pattern for this that I'm just not finding. And I'm somewhat concerned that, as a Ruby/Rails novice, I'm just bending ActiveJob to do something it was never meant to do in the first place. All the ActiveJob examples I'm finding are 'fire and forget' -- asynchronously "returning" a result doesn't seem to be an ActiveJob use case.
Also, it's not clear to me that this will work at all in the case of a back-end like Resque that runs the jobs in a separate process.
What's "the Ruby way" to do this?
Update: As hinted at by dre-hh, ActiveJob turned out not to be the right tool here. It was also unreliable, and overcomplicated for the situation. I switched to Concurrent Ruby instead, which fits the use case better, and which, since the tasks are mostly IO-bound, is fast enough even on MRI, despite the GIL.
ActiveJob is not an async Library like a future or promise.
It is just an interface for performing tasks in a background. The current thread/process receives no result of this operation.
For example when using Sidekiq as ActiveJob queue, it will serialize the parameters of the perform method into the redis store. Another daemon process running within the context of your rails app will be watching the redis queue and instantiate your worker with the serialized data.
So passing callbacks might be alright, however why having them as methods on another class. Passing callbacks would make sense if those are dynamic (changing on different invocation). However as you have them implemented on the calling class, consider just moving those methods into your job worker class.

Mutex for ActiveRecord Model

My User model has a nasty method that should not be called simultaneously for two instances of the same record. I need to execute two http requests in a row and at the same time make sure that any other thread does not execute the same method for the same record at the same time.
class User
...
def nasty_long_running_method
// something nasty will happen if this method is called simultaneously
// for two instances of the same record and the later one finishes http_request_1
// before the first one finishes http_request_2.
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
update_model
end
end
For example this would break everything:
user = User.first
Thread.new { user.nasty_long_running_method }
Thread.new { user.nasty_long_running_method }
But this would be ok and it should be allowed:
user1 = User.find(1)
user2 = User.find(2)
Thread.new { user1.nasty_long_running_method }
Thread.new { user2.nasty_long_running_method }
What would be the best way to make sure the method is not called simultaneously for two instances of the same record?
I found a gem Remote lock when searching for a solution for my problem. It is a mutex solution that uses Redis in the backend.
It:
is accessible for all processes
does not lock the database
is in memory -> fast and no IO
The method looks like this now
def nasty
$lock = RemoteLock.new(RemoteLock::Adapters::Redis.new(REDIS))
$lock.synchronize("capi_lock_#{user_id}") do
http_request_1
http_request_2
update_user
end
end
I would start with adding a mutex or semaphore. Read about mutex: http://www.ruby-doc.org/core-2.1.2/Mutex.html
class User
...
def nasty
#semaphore ||= Mutex.new
#semaphore.synchronize {
# only one thread at a time can enter this block...
}
end
end
If your class is an ActiveRecord object you might want to use Rails' locking and database transactions. See: http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html
def nasty
User.transaction do
lock!
...
save!
end
end
Update: You updated your question with more details. And it seems like my solutions do not really fit anymore. The first solutions does not work if you have multiple instances running. The second locks only the database row, it does not prevent multiple thread from entering the code block at the same time.
Therefore if would think about building a database based semaphore.
class Semaphore < ActiveRecord::Base
belongs_to :item, :polymorphic => true
def self.get_lock(item, identifier)
# may raise invalid key exception from unique key contraints in db
create(:item => item) rescue false
end
def release
destroy
end
end
The database should have an unique index covering the rows for the polymorphic association to item. That should protect multiple thread from getting a lock for the same item at the same time. Your method would look like this:
def nasty
until semaphore
semaphore = Semaphore.get_lock(user)
end
...
semaphore.release
end
There are a couple of problems to solve around this: How long do you want to wait to get the semaphore? What happens if the external http requests take ages? Do you need to store additional pieces of information (hostname, pid) to identifier what thread lock an item? You will need some kind of cleanup task the removes locks that still exist after a certain period of time or after restarting the server.
Furthermore I think it is a terrible idea to have something like this in a web server. At least you should move all that stuff into background jobs. What might solve your problem, if your app is small and needs just one background job to get everything done.
You state that this is an ActiveRecord model, in which case the usual approach would be to use a database lock on that record. No need for additional locking mechanisms as far as I can see.
Take a look at the short (one page) Rails Guides section on pessimistic locking - http://guides.rubyonrails.org/active_record_querying.html#pessimistic-locking
Basically you can get a lock on a single record or a whole table (if you were updating a lot of things)
In your case something like this should do the trick...
class User < ActiveRecord::Base
...
def nasty_long_running_method
with_lock do
// something nasty will happen if this method is called simultaneously
// for two instances of the same record and the later one finishes http_request_1
// before the first one finishes http_request_2.
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
update_model
end
end
end
I recently created a gem called szymanskis_mutex. It is a module that you can include in the class User and provides the method mutual_exclusion(concern) to provide the functionality you want.
It doesnt rely on databases and doesn't depend on how many processes want to enter the critical section at any given moment.
Note that if the class is initialized in different servers it will not work.
I may suite your needs if your app is small enough. Your code would look like this:
class User
include SzymanskisMutex
...
def nasty_long_running_method
mutual_exclusion(:nasty_long) do
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
end
update_model
end
end
I suggest rethinking your architecture as this is not going to be scalable - imagine having multiple ruby processes, failing processes, timeouts etc. Also in-process locking and spawning threads is quite dangerous for application servers.
If you want to sleep well with production then try some async background processing framework for long running tasks with serial queue which will ensure order of running tasks. Just simple RabbitMQ or check this QA Best practice for Rails App to run a long task in the background? , eventually try DB but Optimistic Locking.

Proper way to maintain many connections with Celluloid?

I am currently working on an application that pulls mail from many IMAP mailboxes. It seems like Celluloid is a goot fit for this part, but I'm unsure on how to employ actors.
The application will be run in a distributed fashion. There are x mailboxes to poll and y processes among which these will be divided. So each process has a list of mailboxes they have to poll and this list will change every now and then. This means the pool of connections maintained by each process is dynamic.
My biggest question is: should I spawn a separate ImapConnection actor for each mailbox, or should I make a single ImapListener actor that manages all connections internally?
My current design features the former solution. There's one central Coordinator actor that keeps an array of actors that each manage one imap connection. A new connection is added with a simple:
#connections << ImapConnection.supervise(account_info)
The ImapConnection either polls the IMAP server at regular intervals, or maintains an IDLE connection. If the Coordinator wants to stop polling a mailbox it looks it up in its #connections array and properly disposes of it.
This seems like a logical approach for me that yields many benefits of Celluloid (such as automatic restarting of crashed actors), but I'm struggling to find examples of other software that uses this approach. Is spawning 100's of actors in this fashion proper use of the actor model or should I use a different approach?
Very glad to hear you are using Celluloid. Good question.
Not sure how you create connections and maintain them, whether that be by a TCPSocket you have the ability to manage or not. If you have the ability to manage a TCPSocket directly, you ought to use Celluloid::IO as well as Celluloid itself. I also don't know where you put information pulled in from IMAP connections. These two things influence your strategy.
Your approach is not bad, but yes - it could possibly be improved by adding something to do your heavy lifting, polling workers; another to hold account_info only; and a final actor to trigger the work and/or maintain the IDLE state. So you'd end up with ImapWorker ( a pool ), ImapMaintainer, and ImapRegistry. Right here, I wonder if since you are polling, if you need to keep an open connection rather than allowing information to be pushed. If you plan to poll and still keep connections open, here is what the three actors would do:
ImapRegistry holds your account_info in a Hash. This would have methods on it like add, get, and remove. I recommend a Hash of #credentials so you can use the same ID between ImapMaintainer and ImapRegistry; one holds live connections in its #connections, and one holds account_info instances in its #credentials. Both #connections and #credentials are accessed by the same ID, but one keeps a volatile connection whereas the other only has static data useable to recreate a connection if necessary. In this way, your heavy lifters could die, be respawned, and the entire system could regenerate itself.
ImapMaintainer would have the actual #connections in it, and every( interval ) { } tasks built into it, added to when account_info is stored in ImapRegistry. There are two tasks I see, depending on what frequency you plan to poll. One could be to simply touch the IMAP connection to maintain it, and the other could be to poll the IMAP server with ImapWorker. ImapWorker would be a pool saved in ImapMaintainer as say #worker. So it has #connections, #worker, #polling, and #keepalive. polling could be an #connections.each situation, or you could have a timer per connection, added at the point a connection is created.
ImapWorker has two methods... one is #touch that keeps a connection alive. The main one is #poll, which takes a connection you maintain, and runs a polling process on it. That method returns the information or even better stores it also, then the worker returns to the #worker pool. This would give you the benefit of having the polling process happen in a separate thread rather than just a separate fiber, and also allows the most tricky aspect to be kept out in the most robust yet most unaware kind of actor.
Working backward, if ImapRegistry receives #add, it stores account_info and gives that to ImapMaintainer which creates the connection, and timers ( but it forgets account_info and only creates the connection and timer(s) or just creates the connection and lets one big timer maintain the connection with #worker which is a pool. ImapMaintainer inevitably hits a timer, so at the start and end of its timer it can check its connection. If the connection is gone for some reason, it can recreate it with #registry.get information. Within its timer prompted task, it can run #worker.poll or #worker.alive.
This illustrates the above requirements, showing how the initializers would put together the actor system, and has an incomplete skeleton of methods mentioned.
WORKERS = 9 #de arbitrarily chosen
class ImapRegistry
include Celluloid
def initialize
#maintainer = ImapMaintainer.supervise
#credentials = {}
end
def add( account_info )
...
end
def get( id )
...
end
def remove( id )
...
end
end
class ImapMaintainer
include Celluloid
def initialize
#worker = ImapWorker.pool size: WORKERS
#connections = {}
end
def add( id, credential )
...
end
def remove( id )
...
end
#de These exist if there is one big timer:
def polling
...
end
def keepalive
...
end
end
class ImapWorker
include Celluloid
def initialize
#de Nothing needed.
end
def poll( connection )
...
end
def touch( connection )
...
end
end
registry = ImapRegistry.supervise
I love Celluloid and hope you have a lot of success with it. Please ask if you want anything clarified, but this at least is another strategy for you to consider.

Blocking findAndModify in Ruby MongoDB Driver

I'm trying to achieve something like this in MonogDB:
require 'base64'
require 'mongo'
class MongoDBQueue
def enq(thing)
collection.insert({ payload: Base64.encode64(Marshal.dump(thing))})
end
alias :<< :enq
def deq
until _r = collection.find_and_modify({ sort: {_id: Mongo::ASCENDING}, remove: true})
Thread.pass
end
return Marshal.load(Base64.decode64(_r["payload"]))
end
alias :pop :deq
private
def collection
# database, collection & mongodb index semantics here
end
end
Naturally enough I want a Disk-backed queue in Ruby that doesn't destroy my available memory, I'm using this with the Anemone web spider framework which by default uses the Queue class, there's a fork which can use the SizedQueue class, however when using a SizedQueue for both the "page queue" and "links queue", it often deadlocks, presumably because it's trying to dequeue a page and process it, and it's found new links, and that situation cannot be reconciled.
There's also an existing implementation of a Redis queue, however that also exhausts all my available memory on this machine (Available memory is 16Gb, so it's not trivial)
Because of that I want to use this MongoDB backend, but I think the implementation is insane. The Thread.pass feels like a horrible solution, but Anemone is multi-threaded, and MongoDB doesn't support blocking reads, so it's a tricky situation.
Here's my references:
Redis queue implementation for anemone: https://github.com/chriskite/anemone/blob/queueadapter/lib/anemone/queue/redis.rb
MongoDB findAndModify: http://www.mongodb.org/display/DOCS/findAndModify+Command
Questions:
Can anyone comment about how sane this is, compared to sleep (which should trigger the VM to pass control to the next thread, anyway, but sleep feels dirtier)
Should I perhaps Thread.pass and sleep? ( I guess not, see above)
Can I make that read from MongoDB block? There was talk of that here, but never came to anything: https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/rqnHNFXaZ0w
1) Reads in MongoDB are blocking. If you do a findOne() or a findAndModify(), the call will not return until the data is present in the client side. If you do a find(), the call will not return until you get a cursor: you can then iterate on the cursor as much as you need.
2) By default, writes to MongoDB are "fire and forget". If you care about data integrity, you need to do either safe writes by setting :safe => true in your connection, database, or collection object
Kernel.sleep is actually a better solution, as otherwise you'll spin there (albeit passing control to other threads after each query).
As the findAndModify is atomic, only one thread (even on JRuby) will take the job, so I don't quite understand what's the "blocking" issue here.

Resources