Mutex for ActiveRecord Model - ruby

My User model has a nasty method that should not be called simultaneously for two instances of the same record. I need to execute two http requests in a row and at the same time make sure that any other thread does not execute the same method for the same record at the same time.
class User
...
def nasty_long_running_method
// something nasty will happen if this method is called simultaneously
// for two instances of the same record and the later one finishes http_request_1
// before the first one finishes http_request_2.
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
update_model
end
end
For example this would break everything:
user = User.first
Thread.new { user.nasty_long_running_method }
Thread.new { user.nasty_long_running_method }
But this would be ok and it should be allowed:
user1 = User.find(1)
user2 = User.find(2)
Thread.new { user1.nasty_long_running_method }
Thread.new { user2.nasty_long_running_method }
What would be the best way to make sure the method is not called simultaneously for two instances of the same record?

I found a gem Remote lock when searching for a solution for my problem. It is a mutex solution that uses Redis in the backend.
It:
is accessible for all processes
does not lock the database
is in memory -> fast and no IO
The method looks like this now
def nasty
$lock = RemoteLock.new(RemoteLock::Adapters::Redis.new(REDIS))
$lock.synchronize("capi_lock_#{user_id}") do
http_request_1
http_request_2
update_user
end
end

I would start with adding a mutex or semaphore. Read about mutex: http://www.ruby-doc.org/core-2.1.2/Mutex.html
class User
...
def nasty
#semaphore ||= Mutex.new
#semaphore.synchronize {
# only one thread at a time can enter this block...
}
end
end
If your class is an ActiveRecord object you might want to use Rails' locking and database transactions. See: http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html
def nasty
User.transaction do
lock!
...
save!
end
end
Update: You updated your question with more details. And it seems like my solutions do not really fit anymore. The first solutions does not work if you have multiple instances running. The second locks only the database row, it does not prevent multiple thread from entering the code block at the same time.
Therefore if would think about building a database based semaphore.
class Semaphore < ActiveRecord::Base
belongs_to :item, :polymorphic => true
def self.get_lock(item, identifier)
# may raise invalid key exception from unique key contraints in db
create(:item => item) rescue false
end
def release
destroy
end
end
The database should have an unique index covering the rows for the polymorphic association to item. That should protect multiple thread from getting a lock for the same item at the same time. Your method would look like this:
def nasty
until semaphore
semaphore = Semaphore.get_lock(user)
end
...
semaphore.release
end
There are a couple of problems to solve around this: How long do you want to wait to get the semaphore? What happens if the external http requests take ages? Do you need to store additional pieces of information (hostname, pid) to identifier what thread lock an item? You will need some kind of cleanup task the removes locks that still exist after a certain period of time or after restarting the server.
Furthermore I think it is a terrible idea to have something like this in a web server. At least you should move all that stuff into background jobs. What might solve your problem, if your app is small and needs just one background job to get everything done.

You state that this is an ActiveRecord model, in which case the usual approach would be to use a database lock on that record. No need for additional locking mechanisms as far as I can see.
Take a look at the short (one page) Rails Guides section on pessimistic locking - http://guides.rubyonrails.org/active_record_querying.html#pessimistic-locking
Basically you can get a lock on a single record or a whole table (if you were updating a lot of things)
In your case something like this should do the trick...
class User < ActiveRecord::Base
...
def nasty_long_running_method
with_lock do
// something nasty will happen if this method is called simultaneously
// for two instances of the same record and the later one finishes http_request_1
// before the first one finishes http_request_2.
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
update_model
end
end
end

I recently created a gem called szymanskis_mutex. It is a module that you can include in the class User and provides the method mutual_exclusion(concern) to provide the functionality you want.
It doesnt rely on databases and doesn't depend on how many processes want to enter the critical section at any given moment.
Note that if the class is initialized in different servers it will not work.
I may suite your needs if your app is small enough. Your code would look like this:
class User
include SzymanskisMutex
...
def nasty_long_running_method
mutual_exclusion(:nasty_long) do
http_request_1 // Takes 1-3 seconds.
http_request_2 // Takes 1-3 seconds.
end
update_model
end
end

I suggest rethinking your architecture as this is not going to be scalable - imagine having multiple ruby processes, failing processes, timeouts etc. Also in-process locking and spawning threads is quite dangerous for application servers.
If you want to sleep well with production then try some async background processing framework for long running tasks with serial queue which will ensure order of running tasks. Just simple RabbitMQ or check this QA Best practice for Rails App to run a long task in the background? , eventually try DB but Optimistic Locking.

Related

How to memoize MySQL connection client cleverly in external module used from e.g. sinatra?

I think the question does not pin-point to the real problem, I have difficulties to nail it down precisely and concisely.
I have a gem that implements i.e. MySQL-database "queries" (also inserts, updates...)
module DBGEM::Query
def self.client settings=DBGEM.settings
##client ||= Mysql2::Client.new settings
end
def query_this
client.query(...)
end
def process_insert_that list_of_things
list_of_things.each do |thing|
# process
client.query(...)
end
end
Furthermore, this gem is used by a sinatra app sitting on a forking webserver like puma.
Within the sinatra-app i can now
get '/path' do
happy = DBGEM::Query.query_this
# process happy
great = DBGEM::Query.process_insert_that 1..20
# go on
end
I like that API and this code should open only one database connection.
But as far as I understood, because the code within the 'get' definition is not guaranteed to be the only one accessing the DBGEM::Query stuff at that time, weird things could happen (through race-conditions, shared internal state?).
Is there a clever way to keep the nice syntax and the connection sharing without boilerplate object creation (query = DBGEM::Query.new() #...) wrapping the stuff in a block (DBGEM::Query.process do |query| #...)?
The example above is obviously simplified. The sinatra handling might be more involved, the Queries actually done in a Service object etc.pp. Also, afaiu in a forking webserver environment, the GC would destroy the client (closing the connection - thats how mysql2 is implemented).
I think that the connection will not be closed every time.
##client is shared between DBGEM::Query object itself (in Ruby modules and classes are also objects) and all the instances of that object (to be precise: all the instances of classes to which that object is mixed in).
So, this variable will live as long as the DBGEM::Query object will live.
You can check out when DBGEM::Query object will be garbage collected, by defining finalizer logging a text and observe the server console.
module DBGEM::Query
ObjectSpace.define_finalizer(self, proc { print 'garbage collected' })
..
end
Im not sure, however I guess that DBGEM::Query object will be garbage collected only when you stop the server.
As it goes for weird "things could happen", I believe you mean potential conflicts, race conditions, situations where you create double records, or update the same record nearly at the same time overwriting something, etc. And when that happen you lose data integrity.
IMHO you can't prevent it by allowing only one client instance. I'd suggest aiming for solid database design (unique constrains, indexes, foreign keys, validations) which can raise errors when race condition occure and then handling that errors in your application.

Ruby thread dies after redirect

i start a thread in my controller and want to redirect the user immediately after that.
class Profile::GeneralController < ProfileController
def update
startTheThread(profile)
#sleep(5)
redirect_to 'selection_controller'
end
end
class ProfileController < ApplicationController
def startTheThread(profile = nil)
$collector_threads[current_user.id][sector] = Thread.new {
Thread.current['collecting_status'] = { a: 1, c: 0, c_id: -1, r: false }
start_threaded_collector(sector)
}
end
end
When i tell the controller to sleep for - let's say - 5 seconds, the thread finishes like it was supposed to.
The thread is dead as soon as the user changes to another page - why is that and how can i keep threads alive across controllers.
I'm with #meagar here, this is asking for serious trouble. Presume your process will be killed after you finish making the web request, as that's something that happens under some Ruby on Rails process managers when they're pruning off excess instances.
Sharing data between controller instances should also be considered impossible unless you're persisting the data somehow: Database, session, cookies, or arguments via GET or POST.
In a typical system you'll have N Ruby processes on M machines, and the processes will be started and stopped arbitrarily, without warning, if they're not actively processing any requests. There's no way to reliably share data between these without some external IPC.
You probably want a background server process that these controllers can contact for any information they might need, or a process that can dump data into a database or a service like Redis where it can be picked up.
The way you could architect this is by pushing a job into a Redis queue, have another process that's watching that queue for work and pops the job and processes it. This is easily done with the BLPOP command in Redis, where your thread will block waiting for work, then immediately continue when there's something to do.

Keeping a track of sessions in Ruby and Capybara

I am using the parallel_tests gem to be able to run multiple features at the same time, the problem I face with my scenario is that I have user based sessions (SSO) so only one user can be logged in at a time.
To combat this I was thinking of being able to randomly select users if they are available, but tracking their login status globally presents an issue for me.
My setup
Before and after each scenario a user will login:
Before('#login_automated_user') do
#user = FactoryGirl.attributes_for(:automated_user)
login_steps("#{APP}")
end
After('#login_automated_user') do
logout_steps
end
I was thinking of having a pool of users in an array and randomly selecting one to use and once finished return the user to the pool:
module UserSession
def choose_user
user_array = factory_girl_users.values
if user_array.length > 0
#user = user_array.pop
end
end
def return_user
user_array << #user
end
def factory_girl_users
Hash[:user_1 => FactoryGirl.attributes_for(:automated_user), :user_2 => FactoryGirl.attributes_for(:automated_user_1)]
end
end
World(UserSession)
This would then make my Before and After hooks look like:
Before('#login_automated_user') do
#user = choose_user
login_steps("#{APP}")
end
After('#login_automated_user') do
logout_steps
return_user
end
One issue I can see here is I'm using #user across two sessions (or more if I had more users) so do I need to separate them out?
Would anyone offer some tips/solutions to contemplate please?
Not sure I understand your exact problem, but assuming it's having single browser for few tests that run in parallel, and thus having one user's login, affecting other test that runs in the same time.
One solution, is having named session per test. Not sure it's reliable.
Other is having each of your process that parallel runs, open its own browser and then you won't have any sessions problem.
It's requiring more memory, but doesn't seem to effect tests speed.
Also instead of logging out, after each test, you could just use reset_sessions! in global teardown.

Proper way to maintain many connections with Celluloid?

I am currently working on an application that pulls mail from many IMAP mailboxes. It seems like Celluloid is a goot fit for this part, but I'm unsure on how to employ actors.
The application will be run in a distributed fashion. There are x mailboxes to poll and y processes among which these will be divided. So each process has a list of mailboxes they have to poll and this list will change every now and then. This means the pool of connections maintained by each process is dynamic.
My biggest question is: should I spawn a separate ImapConnection actor for each mailbox, or should I make a single ImapListener actor that manages all connections internally?
My current design features the former solution. There's one central Coordinator actor that keeps an array of actors that each manage one imap connection. A new connection is added with a simple:
#connections << ImapConnection.supervise(account_info)
The ImapConnection either polls the IMAP server at regular intervals, or maintains an IDLE connection. If the Coordinator wants to stop polling a mailbox it looks it up in its #connections array and properly disposes of it.
This seems like a logical approach for me that yields many benefits of Celluloid (such as automatic restarting of crashed actors), but I'm struggling to find examples of other software that uses this approach. Is spawning 100's of actors in this fashion proper use of the actor model or should I use a different approach?
Very glad to hear you are using Celluloid. Good question.
Not sure how you create connections and maintain them, whether that be by a TCPSocket you have the ability to manage or not. If you have the ability to manage a TCPSocket directly, you ought to use Celluloid::IO as well as Celluloid itself. I also don't know where you put information pulled in from IMAP connections. These two things influence your strategy.
Your approach is not bad, but yes - it could possibly be improved by adding something to do your heavy lifting, polling workers; another to hold account_info only; and a final actor to trigger the work and/or maintain the IDLE state. So you'd end up with ImapWorker ( a pool ), ImapMaintainer, and ImapRegistry. Right here, I wonder if since you are polling, if you need to keep an open connection rather than allowing information to be pushed. If you plan to poll and still keep connections open, here is what the three actors would do:
ImapRegistry holds your account_info in a Hash. This would have methods on it like add, get, and remove. I recommend a Hash of #credentials so you can use the same ID between ImapMaintainer and ImapRegistry; one holds live connections in its #connections, and one holds account_info instances in its #credentials. Both #connections and #credentials are accessed by the same ID, but one keeps a volatile connection whereas the other only has static data useable to recreate a connection if necessary. In this way, your heavy lifters could die, be respawned, and the entire system could regenerate itself.
ImapMaintainer would have the actual #connections in it, and every( interval ) { } tasks built into it, added to when account_info is stored in ImapRegistry. There are two tasks I see, depending on what frequency you plan to poll. One could be to simply touch the IMAP connection to maintain it, and the other could be to poll the IMAP server with ImapWorker. ImapWorker would be a pool saved in ImapMaintainer as say #worker. So it has #connections, #worker, #polling, and #keepalive. polling could be an #connections.each situation, or you could have a timer per connection, added at the point a connection is created.
ImapWorker has two methods... one is #touch that keeps a connection alive. The main one is #poll, which takes a connection you maintain, and runs a polling process on it. That method returns the information or even better stores it also, then the worker returns to the #worker pool. This would give you the benefit of having the polling process happen in a separate thread rather than just a separate fiber, and also allows the most tricky aspect to be kept out in the most robust yet most unaware kind of actor.
Working backward, if ImapRegistry receives #add, it stores account_info and gives that to ImapMaintainer which creates the connection, and timers ( but it forgets account_info and only creates the connection and timer(s) or just creates the connection and lets one big timer maintain the connection with #worker which is a pool. ImapMaintainer inevitably hits a timer, so at the start and end of its timer it can check its connection. If the connection is gone for some reason, it can recreate it with #registry.get information. Within its timer prompted task, it can run #worker.poll or #worker.alive.
This illustrates the above requirements, showing how the initializers would put together the actor system, and has an incomplete skeleton of methods mentioned.
WORKERS = 9 #de arbitrarily chosen
class ImapRegistry
include Celluloid
def initialize
#maintainer = ImapMaintainer.supervise
#credentials = {}
end
def add( account_info )
...
end
def get( id )
...
end
def remove( id )
...
end
end
class ImapMaintainer
include Celluloid
def initialize
#worker = ImapWorker.pool size: WORKERS
#connections = {}
end
def add( id, credential )
...
end
def remove( id )
...
end
#de These exist if there is one big timer:
def polling
...
end
def keepalive
...
end
end
class ImapWorker
include Celluloid
def initialize
#de Nothing needed.
end
def poll( connection )
...
end
def touch( connection )
...
end
end
registry = ImapRegistry.supervise
I love Celluloid and hope you have a lot of success with it. Please ask if you want anything clarified, but this at least is another strategy for you to consider.

Background thread in Rails can't see instance variables

I need to gather up some data from a rails application, aggregate it, and send it off to a remote server periodically. I instantiate my aggregation class in a global variable (I know, I know) in application.rb.
Inside my aggregation class, I fire up a thread that sleeps for 10 seconds, then looks at the queue, processes the data, and sends it. The queue is a hash stored in an instance variable of the class.
From the rails controller, I call a method in the aggregator class to queue the data in the hash. Of course this is on a different thread than the background task that reads the queue. The problem is that the background task never sees any data in the hash. In my log, I print out the object_id of the hash both when I write to it (from the controllers thread), and when I read from it (from the background thread). The hash#object_id matches from both threads, but the background thread never sees the data.
Whats killing me is that this works fine outside of rails. I've set up tests with many threads that really pound on it, and it works fine (there is some thread protection that I am not showing for clarity). Anyone know how the object_id's can match, but the contents are not consistent?
class Aggregator
def initialize
#q = {}
#timer = nil
end
def start
#timer = Thread.new do
loop do
sleep(10)
flush_q
end
end
end
def flush_q
logger.debug "flush: q.object_id = #{#q.object_id}" # matches what I get below
logger.debug "flush: q.length = #{#q.length}" # always zero!
#q.each_pair do |k,v|
# pack it up and send it
end
#q.clear
end
def add(item)
logger.debug "add: q.object_id = #{#q.object_id}" # matches what I get above
#q[item.name] ||= item
logger.debug "add: q.length = #{#q.length}" # increases with each add
# not actually that simple, but not relevant
end
end
I'm going to go out on a limb and assume that your code is deployed using a forking app server (eg unicorn or passenger).
This means that your app is loaded once and then new instances are forked from that master instances. Forking is cheap so this means that new instances of the app can be started up/shutdown really quickly.
I believe that your aggregator instance is getting created/started in this master process. When this forks the process's entire memory space is copied (so there an instance of aggregator in the new process, with the same object id and so on).
However when forking only the current thread is copied , so the aggregator flushing is only happening in the master process, but all the appending is happening in the child processes. You could confirm this by adding Proccess.pid to what you log - you should see that your logging is coming from 2 different process.
One way of fixing this would be to start/restart your thread after the child process has forked. How you do this depends on how the app is being served. With unicorn you can do this in your unicorn config via the after_fork method. With passenger you do
PhusionPassenger.on_event(:starting_worker_process) do |forked|
if forked
...
end
end

Resources