Is assignment an atomic operation in ruby MRI? - ruby

Let's say I have these two methods in my class.
def set_val(val)
#val = val
end
def get_val
#val
end
I'll spawn multiple threads to call set_val with different values. Is it guaranteed that reading from #val returns the correct value, i.e., not the last assigned value, but a value that was passed to set_val? Could I get something strange when reading? Is the assignment operation atomic? Is it indivisible irrespective of the number of threads?

This depends a bit on the Ruby implementation you are using. As for MRI Ruby (the "default" Ruby), this is a safe (atomic) operation due to its Global Interpreter Lock which guards some operations such as assignments from bein interrupted by context switches.
JRuby also guarantees that some operations are thread-safe, including assignment to instance variables.
In any case, please make sure to take into account that any such concurrent access can be serialized in a seemingly random way. That is, you can't guarantee which threads assigns first and which one last unless you use explicit locks such as a Mutex.

Related

Why a+=1 is a thread-safe operation in ruby?

Code Snippet:
a = 0
Array.new(50){
Thread.new {
500_000.times { a += 1 }
}
}.each(&:join)
p "a: #{a}"
Result: a = 25_000_000.
In my understanding, (MRI) Ruby use GIL, so there is only one ruby thread can get the CPU, but when thread-switch happend, some data of ruby thread will be stored for restoring thread later. So, in theory, a += 1 may not be thread-safe.
But the result above turns out I'm wrong. Does Ruby makes a+=1 atomic? If true, which operations can be considered thread-safe?
It's Neither Atomic Nor Thread-Safe
In your example, the apparent consistency is largely due to the global interpreter lock, but is also partly due to the way your Ruby engine and your code sequences (theoretically) asynchronous threads. You are getting consistent results because each loop in each thread is simply incrementing the current value of a, which is not a block-local or thread-local variable. With threads on the YARV virtual machine, only one thread at a time is inspecting or setting the current value of a, but I wouldn't really say that it's an atomic operation. It's just a byproduct of the engine’s lack of real-time concurrency between threads, and the underlying implementation of the Ruby virtual machine.
If you're concerned about preserving thread-safety in Ruby without relying on idiosyncratic behaviors that just happen to appear consistent, consider using a thread-safe library like concurrent-ruby. Otherwise, you may be relying on behaviors that aren't guaranteed across Ruby engines or Ruby versions.
For example, three consecutive runs of your code in JRuby (which does have concurrent threads) will generally yield different results on each run. For example:
#=> "a: 3353241"
#=> "a: 3088145"
#=> "a: 2642263"
Ruby doesn't have a well-defined Memory Model, so in some philosophical sense, the question is non-sensical, since without a Memory Model, the term "thread-safe" isn't even defined. For example, the ISO Ruby Language Specification doesn't even document the Thread class.
The way that people write concurrent code in Ruby without a well-defined Memory Model is essentially "guess-and-test". You guess what the implementations will do, then you test as many versions of as many implementations on as many platforms and as many operating systems on as many CPU architectures and as many different system sizes as possible.
As you can see in Todd's answer, even just testing one other implementation already reveals that your conclusion was wrong. (Pro tip: never make a generalization based on a sample size of 1!)
The alternative is to use a library that has already done the above, such as the concurrent-ruby library mentioned in Todd's answer. They do all the testing I mentioned above. They also work closely with the maintainers of the various implementations. E.g. Chris Seaton, the lead developer of TruffleRuby is also one of the maintainers of concurrent-ruby, and Charlie Nutter, the lead developer of JRuby, is one of the contributors.
But the result above turns out I'm wrong.
The results are misleading. In Ruby, a += 1 is a shorthand for:
a = a + 1
With a + 1 being a method call that occurs before the assignment. Since integers are objects in Ruby, we can override that method:
module ThreadTest
def +(other)
super
end
end
Integer.prepend(ThreadTest)
The above code doesn't do anything useful, it just calls super. But merely adding a Ruby implementation on top of the built-in C implementation is enough to break (or fix) your test:
Integer.prepend(ThreadTest)
a = 0
Array.new(50){
Thread.new {
500_000.times { a += 1 }
}
}.each(&:join)
p "a: #{a}"
#=> "a: 11916339"

Why does my Ruby script slow down over time?

I have a 2.6 gigabyte text file containing a dump of a database table, and I'm trying to pull it into a logical structure so the fields can all be uniqued. The code I'm using to do this is here:
class Targetfile
include Enumerable
attr_accessor :inputfile, :headers, :input_array
def initialize(file)
#input_array = false
#inputfile = File.open(file, 'r')
#x = #inputfile.each.count
end
def get_headers
#y = 1
#inputfile.rewind
#input_array = Array.new
#headers = #inputfile.first.chomp.split(/\t/)
#inputfile.each do |line|
print "\n#{#y} / #{#x}"
#y+=1
self.assign_row(line)
end
end
def assign_row(line)
row_array = line.chomp.encode!('UTF-8', 'UTF-8', :invalid => :replace).split(/\t/)
#input_array << Hash[ #headers.zip(row_array) ]
end
def send_build
#input_array || self.get_headers
end
def each
self.send_build.each {|row| yield row}
end
end
The class is initialized successfully and I am left with a Targetfile class object.
The problem is that when I then call the get_headers method, which converts the file into an array of hashes, it begins slowing down immediately.
This isn't noticeable to my eyes until around item number 80,000, but then it becomes apparent that every 3-4,000 lines of the file, some sort of pause is occurring. That pause, each time it occurs, takes slightly longer, until by the millionth line, it's taking longer than 30 seconds.
For practical purposes, I can just chop up the file to avoid this problem, then combine the resulting lists and unique -that- to get my final outputs.
From a curiosity standpoint, however, I'm unsatisfied.
Can anyone tell me why this pause is occurring, why it gets longer, and if there's any way to avoid it elegantly? Really I just want to know what it is and why it happens, because now that I've noticed it, I see it in a lot of other Ruby scripts I run, both on this computer and on others.
I'd suggest doing this in the DBM, not Ruby or any other language. A DBM can tell you the unique values for a field very quickly, especially if it's already indexed.
Trying to do this in any language is duplicating the basic functionality of the database in something designed for general computing.
Instead, use Ruby with an ORM like Sequel or Active Record, and issue queries to the database and let it return the things you want to know. Don't iterate over every row, that's madness, ask it to give you the unique values and go from there.
I wouldn't blame Ruby, because the same problem would occur in any other language given the same host and RAM. C/C++ might delay the inevitable by generating more compact code, but your development time will slow drastically, especially as you learn an unknown language like C. And the risk of unintended errors goes up because you have to do a lot more housekeeping and defensive programming than you'd do in Ruby, Python, or Perl.
Use each tool for what it's designed for and you'll be ahead.
Looking at your code, you could probably improve the chances of making it through a complete run by NOT trying to keep every row in memory. You said you're trying to determine uniqueness, so keep only the unique column values you're interested in, which you can do easily using Ruby's Set class. You can throw the values of each thing you want to determine uniqueness on, walk the file, and Set will only keep the unique values.
This is the infamous garbage collector -- Ruby's memory managment mechanism.
Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.
The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.
There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.
Also, your OS may be paging. Do you have enough physical memory for this kind of task?
You are using the headers as keys for the hash. They are strings, and hashes duplicate string keys. That is a lot of unnecessary strings. Try if converting them to symbols speeds things up:
#headers = #headers.map{|header| header.to_sym}
This is the Garbage Collector. You can force garbage collection by putting in GC.start in your program. Have it run periodically.
I had to do the same thing for a daemon I wrote. It works well.
http://ruby-doc.org/core-1.9.3/GC.html

How should I handle sparse instance variables from a gem class like feedzirra?

My objective is to use feedzirra, or a viable alternative, to access many varying RSS feeds and store them in ActiveRecords for later processing. Obviously, these feeds will vary in composition.
Just running the sample feedzirra code, I've found that its implementation of instance variables is sparse. If the feed doesn't provide the information, the variable is left at nil. This is okay, but as soon as I run a method like sanitize, I receive a NoMethod error on Nil.
Doing some research, I see there is an instance_variables method that would allow me to grab active variables. I could use that but it leaves the problem as to later downstream code that may be checking for instance variables that simply don't exist.
I'm torn on how to handle the situation of sparse instance variables. As I write my code, I need to be able to rely on the input so that I can run processes and methods without (too much) concern as to its reliability or existence.
At this point, I've simply set a rescue NoMethodError trap which is wholly insufficient. I figure that I have to start with some of my own "sanitize" methods to ensure that the input is both safe and reliable. But, when I get nil input, what do I do? I cannot leave it nil as later methods will fail. I could inject some standard string such as " " or "unavailable", either of which could occur in the wild.
This is the first time that I have gathered input that I'd considered wholly unsafe and unreliable. I need recommendations on what I need to do to clean it up before I process it.
At this point, lacking a viable alternative, I have fallen back to checking for nil on the variables as soon as I receive them and setting them to pre-defined values that I can check later needed.
I'm not sure this is the best solution, but getting nil NoMethodError's isn't either...

how to know what is NOT thread-safe in ruby?

starting from Rails 4, everything would have to run in threaded environment by default. What this means is all of the code we write AND ALL the gems we use are required to be threadsafe
so, I have few questions on this:
what is NOT thread-safe in ruby/rails? Vs What is thread-safe in ruby/rails?
Is there a list of gems that is known to be threadsafe or vice-versa?
is there List of common patterns of code which are NOT threadsafe example #result ||= some_method?
Are the data structures in ruby lang core such as Hash etc threadsafe?
On MRI, where there a GVL/GIL which means only 1 ruby thread can run at a time except for IO, does the threadsafe change effect us?
None of the core data structures are thread safe. The only one I know of that ships with Ruby is the queue implementation in the standard library (require 'thread'; q = Queue.new).
MRI's GIL does not save us from thread safety issues. It only makes sure that two threads cannot run Ruby code at the same time, i.e. on two different CPUs at the exact same time. Threads can still be paused and resumed at any point in your code. If you write code like #n = 0; 3.times { Thread.start { 100.times { #n += 1 } } } e.g. mutating a shared variable from multiple threads, the value of the shared variable afterwards is not deterministic. The GIL is more or less a simulation of a single core system, it does not change the fundamental issues of writing correct concurrent programs.
Even if MRI had been single-threaded like Node.js you would still have to think about concurrency. The example with the incremented variable would work fine, but you can still get race conditions where things happen in non-deterministic order and one callback clobbers the result of another. Single threaded asynchronous systems are easier to reason about, but they are not free from concurrency issues. Just think of an application with multiple users: if two users hit edit on a Stack Overflow post at more or less the same time, spend some time editing the post and then hit save, whose changes will be seen by a third user later when they read that same post?
In Ruby, as in most other concurrent runtimes, anything that is more than one operation is not thread safe. #n += 1 is not thread safe, because it is multiple operations. #n = 1 is thread safe because it is one operation (it's lots of operations under the hood, and I would probably get into trouble if I tried to describe why it's "thread safe" in detail, but in the end you will not get inconsistent results from assignments). #n ||= 1, is not and no other shorthand operation + assignment is either. One mistake I've made many times is writing return unless #started; #started = true, which is not thread safe at all.
I don't know of any authoritative list of thread safe and non-thread safe statements for Ruby, but there is a simple rule of thumb: if an expression only does one (side-effect free) operation it is probably thread safe. For example: a + b is ok, a = b is also ok, and a.foo(b) is ok, if the method foo is side-effect free (since just about anything in Ruby is a method call, even assignment in many cases, this goes for the other examples too). Side-effects in this context means things that change state. def foo(x); #x = x; end is not side-effect free.
One of the hardest things about writing thread safe code in Ruby is that all core data structures, including array, hash and string, are mutable. It's very easy to accidentally leak a piece of your state, and when that piece is mutable things can get really screwed up. Consider the following code:
class Thing
attr_reader :stuff
def initialize(initial_stuff)
#stuff = initial_stuff
#state_lock = Mutex.new
end
def add(item)
#state_lock.synchronize do
#stuff << item
end
end
end
A instance of this class can be shared between threads and they can safely add things to it, but there's a concurrency bug (it's not the only one): the internal state of the object leaks through the stuff accessor. Besides being problematic from the encapsulation perspective, it also opens up a can of concurrency worms. Maybe someone takes that array and passes it on to somewhere else, and that code in turn thinks it now owns that array and can do whatever it wants with it.
Another classic Ruby example is this:
STANDARD_OPTIONS = {:color => 'red', :count => 10}
def find_stuff
#some_service.load_things('stuff', STANDARD_OPTIONS)
end
find_stuff works fine the first time it's used, but returns something else the second time. Why? The load_things method happens to think it owns the options hash passed to it, and does color = options.delete(:color). Now the STANDARD_OPTIONS constant doesn't have the same value anymore. Constants are only constant in what they reference, they do not guarantee the constancy of the data structures they refer to. Just think what would happen if this code was run concurrently.
If you avoid shared mutable state (e.g. instance variables in objects accessed by multiple threads, data structures like hashes and arrays accessed by multiple threads) thread safety isn't so hard. Try to minimize the parts of your application that are accessed concurrently, and focus your efforts there. IIRC, in a Rails application, a new controller object is created for every request, so it is only going to get used by a single thread, and the same goes for any model objects you create from that controller. However, Rails also encourages the use of global variables (User.find(...) uses the global variable User, you may think of it as only a class, and it is a class, but it is also a namespace for global variables), some of these are safe because they are read only, but sometimes you save things in these global variables because it is convenient. Be very careful when you use anything that is globally accessible.
It's been possible to run Rails in threaded environments for quite a while now, so without being a Rails expert I would still go so far as to say that you don't have to worry about thread safety when it comes to Rails itself. You can still create Rails applications that aren't thread safe by doing some of the things I mention above. When it comes other gems assume that they are not thread safe unless they say that they are, and if they say that they are assume that they are not, and look through their code (but just because you see that they go things like #n ||= 1 does not mean that they are not thread safe, that's a perfectly legitimate thing to do in the right context -- you should instead look for things like mutable state in global variables, how it handles mutable objects passed to its methods, and especially how it handles options hashes).
Finally, being thread unsafe is a transitive property. Anything that uses something that is not thread safe is itself not thread safe.
In addition to Theo's answer, I'd add a couple problem areas to lookout for in Rails specifically, if you're switching to config.threadsafe!
Class variables:
##i_exist_across_threads
ENV:
ENV['DONT_CHANGE_ME']
Threads:
Thread.start
starting from Rails 4, everything would have to run in threaded environment by default
This is not 100% correct. Thread-safe Rails is just on by default. If you deploy on a multi-process app server like Passenger (community) or Unicorn there will be no difference at all. This change only concerns you, if you deploy on a multi-threaded environment like Puma or Passenger Enterprise > 4.0
In the past if you wanted to deploy on a multi-threaded app server you had to turn on config.threadsafe, which is default now, because all it did had either no effects or also applied to a Rails app running in a single process (Prooflink).
But if you do want all the Rails 4 streaming benefits and other real time stuff of the multi-threaded deployment
then maybe you will find this article interesting. As #Theo sad, for a Rails app, you actually just have to omit mutating static state during a request. While this a simple practice to follow, unfortunately you cannot be sure about this for every gem you find. As far as i remember Charles Oliver Nutter from the JRuby project had some tips about it in this podcast.
And if you want to write a pure concurrent Ruby programming, where you would need some data structures which are accessed by more than one thread you maybe will find the thread_safe gem useful.

Ruby Semaphores?

I'm working on an implementation of the "Fair Barbershop" problem in Ruby. This is for a class assignment, but I'm not looking for any handouts. I've been searching like crazy, but I cannot seem to find a Ruby implementation of Semaphores that mirror those found in C.
I know there is Mutex, and that's great. Single implementation, does exactly what that kind of semaphore should do.
Then there's Condition Variables. I thought that this was going to work out great, but looking at these, they require a Mutex for every wait call, which looks to me like I can't put numerical values to the semaphore (as in, I have seven barbershops, 3 barbers, etc.).
I think I need a Counting Semaphore, but I think it's a little bizarre that Ruby doesn't (from what I can find) contain such a class in its core. Can anyone help point me in the right direction?
If you are using JRuby, you can import semaphores from Java as shown in this article.
require 'java'
java_import 'java.util.concurrent.Semaphore'
SEM = Semaphore.new(limit_of_simultaneous_threads)
SEM.acquire #To decrement the number available
SEM.release #To increment the number available
There's http://sysvipc.rubyforge.org/SysVIPC.html which gives you SysV semaphores. Ruby is perfect for eliminating the API blemishes of SysV semaphores and SysV semaphores are the best around -- they are interprocess semaphores, you can use SEM_UNDO so that even SIGKILLs won't mess up your global state (POSIX interprocess semaphores don't have this), and you with SysV semaphores you can perform atomic operations on several semaphores at once as long as they're in the same semaphore set.
As for inter-thread semaphores, those should be perfectly emulatable with Condition Variables and Mutexes. (See Bernanrdo Martinez's link for how it can be done).
I also found this code:
https://gist.github.com/pettyjamesm/3746457
probably someone might like this other option.
since concurrent-ruby is stable (beyond 1.0) and is being widely used thus the best (and portable across Ruby impls) solution is to use its Concurrent::Semaphore class
Thanks to #x3ro for his link. That pointed me in the right direction. However, with the implementation that Fukumoto gave (at least for rb1.9.2) Thread.critical isn't available. Furthermore, my attempts to replace the Thread.critical calls with Thread.exclusive{} simply resulted in deadlocks. It turns out that there is a proposed Semaphore patch for Ruby (which I've linked below) that has solved the problem by replacing Thread.exclusive{} with a Mutex::synchronize{}, among a few other tweaks. Thanks to #x3ro for pushing me in the right direction.
http://redmine.ruby-lang.org/attachments/1109/final-semaphore.patch
Since the other links here aren't working for me, I decided to quickly hack something together. I have not tested this, so input and corrections are welcome. It's based simply on the idea that a Mutex is a binary Semaphore, thus a Semaphore is a set of Mutexes.
https://gist.github.com/3439373
I think it might be useful to mention the Thread::Queue in this context for others arriving at this question.
The Queue is a thread-safe tool (implemented with some behind-the-scenes synchronization primitives) that can be used like a traditional multi-processing semaphore with just a hint of imagination. And it comes preloaded by default, at least in ruby v3:
#!/usr/bin/ruby
# hold_your_horses.rb
q = Queue.new
wait_thread = Thread.new{
puts "Wait for it ..."
q.pop
puts "... BOOM!"
}
sleep 1
puts "... click, click ..."
q.push nil
wait_thread.join
And can be demonstrated simply enough:
user#host:~/scripts$ ruby hold_your_horses.rb
Wait for it ...
... click, click ...
... BOOM!
The docs for ruby v3.1 say a Queue can be initialized with an enumerable object to set up initial contents but that wasn't available in my v3.0. But if you want a semaphore with, say, 7 permits, it's easy to stuff the box with something like:
q = Queue.new
7.times{ q.push nil }
I used the Queue to implement baton-passing between some worker-threads:
class WaitForBaton
def initialize
#q = Queue.new
end
def pass_baton
#q.push nil
sleep 0.0
end
def wait_for_baton
#q.pop
end
end
So that thread task_master could perform steps one and three with thread little_helper stepping in at the appropriate time to handle step two:
baton = WaitForBaton.new
task_master = Thread.new{
step_one(ARGV[0])
baton.pass_baton
baton.wait_for_baton
step_three(logfile)
}
little_helper = Thread.new{
baton.wait_for_baton
step_two(ARGV[1])
baton.pass_baton
}
task_master.join
little_helper.join
Note that the sleep 0.0 in the .pass_baton method of my WaitForBaton class is necessary to prevent task_master from passing the baton to itself: unless thread scheduling happens to jump away from task_master right after baton.pass_baton, the very next thing that happens is task_master's baton.wait_for_baton - which takes the baton right back again. sleep 0.0 explicitly cedes execution to any other threads that might be waiting to run (and, in this case, blocking on the underlying Queue).
Ceding execution is not the default behavior because this is a somewhat unusual usage of semaphore technology - imagine that task_master could be generating many tasks for little_helpers to do and task_master can efficiently get right back to generating tasks right after passing a task off through a Thread::Queue's .push([object]) method.

Resources