I'm trying to write a multi-threaded code to achieve parallelism for a task that is taking too much time. Here is how it looks:
class A
attr_reader :mutex, :logger
def initialize
#reciever = ZeroMQ::Queue
#sender = ZeroMQ::Queue
#mutex = Mutex.new
#logger = Logger.new('log/test.log')
end
def run
50.times do
Thread.new do
run_parallel(#reciever.get_data)
end
end
end
def run_parallel(data)
## Define some local variables.
a , b = data
## Log some data to file.
logger.info "Got #{a}"
output = B.get_data(b)
## Send output back to zermoq.
mutex.synchronize { #sender.send_data(output} }
end
end
One needs to make sure that code is thread safe. Sharing and changing data (like #,##,$ without proper mutex) across threads could lead to thread safety issue.
I'm not sure whether if I pass the data to a method, that results in thread safety issue as well. In other words, do I have to ensure that the part of my code inside run_parallel has to be wrapped in a mutex if I'm not using any #, ##, $ inside the method? Or is the given mutex definition enough?
mutex.synchronize { #sender.send_data(output} }
Whenever you're running in a threaded context, you've got to be aware (for a simple heuristic) of anything that's not a local variable. I see these potential problems in your code:
run_parallel(#reciever.get_data) Is get_data threadsafe? You've synchronized send_data, and they're both a ZeroMQ::Queue, so I'm guessing not.
output = B.get_data(b) Is this call threadsafe? If it just pulls something out of b, you're fine, but if it uses state in B or calls anything else that has state, you're in trouble.
logger.info "Got #{a}" #coreyward points out that Logger is threadsafe, so this is no trouble. Just make sure to stick with it over puts, which will garble your output.
Once you're inside the mutex for #sender.send_data, you're safe, assuming #sender isn't accessed anywhere else in your code by another thread. Of course, the more synchronize you throw around, the more your threads will block on each other and lose performance, so there's a balance you need to find your design.
Do what you can to make your code functional: try to use only local state and write methods that don't have side effects. As your task gets more complicated, there are libraries like concurrent-ruby with threadsafe data structures and other patterns that can help.
Related
(Posted already at https://www.ruby-forum.com/topic/6876320, but crossposted here, because I did not receive a response so far).
A question about parallelizing tests in Minitest and/or Test::Unit (i.e. proper use of parallelize_me!):
Assume that I have some helper methods, which are needed by several tests. From my understanding, I could NOT do something like this in such a method (simplified example):
def prep(m,n)
#pid = m
#state = n
end
def process
if #stat > 5 && #pid != 0
...
else
...
end
end
I think I can't do this in Minitest and test-unit, because if I call prep and process from several of my test function, the tests can not be parallelized anymore - those test functions all set and read the same instance variable. Right?
Now, my question is, whether the following approach would be safe for parallelization: I make all of these mutable instance variables a hash, which I initialized in setup like this:
def setup
#pid ||= {}
#state ||= {}
end
My "helper methods" receive a key (for example, the name of the test
method) and use it to access the their "own" hash element:
def prep(key,m,n)
#pid[key] = m
#state[key] = n
end
def process
if #stat[key] > 5 && #pid[key] != 0
...
else
...
end
end
It's a bit ugly, but: Is this a reliable approach? Is this way of accessing a hash thread-safe? How can I do it better?
At least in Minitest you can safely do, for example,
setup do
#form = Form.new
end
without #form getting mixed up between parallel tests, so this approach should be safe too:
def setup
#stat = m
#pid = n
end
which means that your original approach should be safe as well.
================
UPDATE
consider the following gist with a piece of code that define 100 different tests accessing #random which is set in setup https://gist.github.com/bbozo/2a64e1f53d29747ca559
You will notice that the stuff set in setup isn't shared among tests, it is run before every test, basically every test is encapsulated so thread safety isn't an issue.
Your approach with the hashes makes sense, and it will work to distinguish between the threads. The problem lies with the Global Interpreter Lock.
Unless your helper methods are IO-bound (make HTTP requests, socket requests, handle local files), you won't see a speed improvement because Ruby will pretty much (to simplify things) run your code sequentially over multiple threads, without a guaranteed run order.
Good luck!
I am new to Ruby. I am confused by something I am reading here:
http://alma-connect.github.io/techblog/2014/03/rails-pub-sub.html
They offer this code:
# app/pub_sub/publisher.rb
module Publisher
extend self
# delegate to ActiveSupport::Notifications.instrument
def broadcast_event(event_name, payload={})
if block_given?
ActiveSupport::Notifications.instrument(event_name, payload) do
yield
end
else
ActiveSupport::Notifications.instrument(event_name, payload)
end
end
end
What is the difference between doing this:
ActiveSupport::Notifications.instrument(event_name, payload) do
yield
end
versus doing this:
ActiveSupport::Notifications.instrument(event_name, payload)
yield
If this were another language, I might assume that we first call the method instrument(), and then we call yield so as to call the block. But that is not what they wrote. They show yield being nested inside of ActiveSupport::Notifications.instrument().
Should I assume that ActiveSupport::Notifications.instrument() is returning some kind of iterable, that we will iterate over? Are we calling yield once for every item returned from ActiveSupport::Notifications.instrument()?
While blocks are frequently used for iteration they have many other uses. One is to ensure proper resource cleanup, for example
ActiveRecord::Base.with_connection do
...
end
Checks out a database connection for the thread, yields to the block and then checks the connection back in.
In the specific case of the instrument method you found what it does is add to the event data it is about to broadcast information about the time it's block took to execute. The actual implementation is more complicated but in broad terms it's not so different to
event = Event.new(event_name, payload)
event.start = Time.now
yield
event.end = Time.now
event
The use of yield allows it to wrap the execution of your code with some timing code. In your second example no block is passed to instrument, which detects this and will record it as an event having no duration
The broadcast_event method has been designed to accept an optional block (which allows you to pass a code block to the method).
ActiveSupport::Notifications.instrument also takes an optional block.
Your first example simply takes the block passed in to broadcast_event and forwards it along to ActiveSupport::Notifications.instrument. If there's no block, you can't yield anything, hence the different calls.
Ruby has native support for thread local variables since version 2.0. However active_support/core_ext/thread.rb implements this feature in pure ruby for support of thread locals in earlier versions of Ruby. So, I wonder why should we use mutex in _locals method:
https://github.com/rails/rails/blob/ec1227a9cc682ebf796689ef0f329038162c421b/activesupport/lib/active_support/core_ext/thread.rb#L76
_locals does two things:
def _locals
# 1. Returns the local variable hash when defined
if defined?(#_locals)
#_locals
# 2. Lazily instantiates a locals hash
else
LOCK.synchronize { #_locals ||= {} }
end
end
That synchronization step is required to ensure that the #_locals is never cleared out during first access.
Consider the following scenario:
thread = Thread.new
# Say I run this statement...
thread.thread_variable_set('a', 1)
# In parallel with this statement...
thread.thread_variable_get(:a)
Both of those methods call _locals, and if they execute simultaneously, they may both end up at the lazy assignment step:
#_locals ||= {}
# Expands to...
unless #_locals
#_locals = {} # <-- We could end up here with both threads at the same time,
end # which jeopardizes any value that might have been set.
So imagine we had no mutex and that the setter completed execution while the getter had entered the lazy assignment step. We've effectively lost any locals we set due to a thread collision. Calling synchronize on a mutex guarantee that the block executes to completion without any such collisions.
Note that the core extension will not be loaded for versions of Ruby which support accessing thread-local variables. See the very last line:
unless Thread.instance_methods.include?(:thread_variable_set)
can anybody explain me why the Redis (redis-rb) synchrony driver works directly under EM.synchrony block but doesn't within EM:Connection?
Considering following example
EM.synchrony do
redis = Redis.new(:path => "/usr/local/var/redis.sock")
id = redis.incr "local:id_counter"
puts id
EM.start_server('0.0.0.0', 9999) do |c|
def c.receive_data(data)
redis = Redis.new(:path => "/usr/local/var/redis.sock")
puts redis.incr "local:id_counter"
end
end
end
I'm getting
can't yield from root fiber (FiberError)
when using within receive_data. From reading source code for both EventMachine and em-synchrony I can't figure out what's the difference.
Thanks!
PS: Obvious workaround is to wrap the redis code within EventMachine::Synchrony.next_tick as hinted at issue #59, but given the EM.synchrony I would expect to already have the call wrapped within Fiber...
PPS: same applies for using EM::Synchrony::Iterator
You're doing some rather tricky here.. You're providing a block to start_server, which effectively creates an "anonymous" connection class and executes your block within the post_init method of that class. Then within that class you're defining an instance method.
The thing to keep in mind is: when the reactor executes a callback, or a method like receive_data, that happens on the main thread (and within root fiber), which is why you're seeing this exception. To work around this, you need to wrap each callback to be executed within a Fiber (ex, see Synchrony.add_(periodic)_timer methods).
To address your actual exception: wrap the execution of receive_data within a Fiber. The outer EM.synchrony {} won't do anything for callbacks which are scheduled later by the reactor.
I'm trying to make an API for dynamic reloading processes; right now I'm at the point where I want to provide in all contexts a method called reload!, however, I'm implementing this method on an object that has some state (so it can't be on Kernel).
Suppose we have something like
WorkerForker.run_in_worker do
# some code over here...
reload! if some_condition
end
Inside the run_in_worker method there is a code like the following:
begin
worker = Worker.new(pid, stream)
block.call
rescue NoMethodError => e
if (e.message =~ /reload!/)
puts "reload! was called"
worker.reload!
else
raise e
end
end
So I'm doing it this way because I want to make the reload! method available in any nested context, and I don't wanna mess the block I'm receiving with an instance_eval on the worker instance.
So my question is, is there any complications regarding this approach? I don't know if anybody has done this already (haven't read that much code yet), and if it has been done already? Is there a better way to achieve the objective of this code?
Assuming i understand you now, how about this:
my_object = Blah.new
Object.send(:define_method, :reload!) {
my_object.reload!
...
}
Using this method every object that invokes the reload! method is modifying the same shared state since my_object is captured by the block passed to define_method
what's wrong with doing this?
def run_in_worker(&block)
...
worker = Worker.new(pid, stream)
block.call(worker)
end
WorkerForker.run_in_worker do |worker|
worker.reload! if some_condition
end
It sounds like you just want every method to know about an object without the method or the method's owner having been told about it. The way to accomplish this is a global variable. It's not generally considered a good idea (because it leads to concurrency issues, ownership issues, makes unit testing harder, etc.), but if that's what you want, there it is.