Ruby thread safe thread creation - ruby

I came across this bit of code today:
#thread ||= Thread.new do
# this thread should only spin up once
end
It is being called by multiple threads. I was worried that if multiple threads are calling this code that you could have multiple threads created since #thread access is not synchronized. However, I was told that this could not happen because of the Global Interpreter Lock. I did a little bit of reading about threads in Ruby and it seems like individual threads that are running Ruby code can get preempted by other threads. If this is the case, couldn't you have an interleaving like this:
Thread A Thread B
======== ========
Read from #thread .
Thread.New .
[Thread A preempted] .
. Read from #thread
. Thread.New
. Write to #thread
Write to #thread
Additionally, since access to #thread is not synchronized are writes to #thread guaranteed to be visible to all other threads? The memory models of other languages I've used in the past do not guarantee visibility of writes to memory unless you synchronize access to that memory using atomics, mutexes, etc.
I'm still learning Ruby and realize I have a long way to go to understanding concurrency in Ruby. Any help on this would be super appreciated!

You need a mutex. Essentially the only thing the GIL protects you from is accessing uninitialized memory. If something in Ruby could be well-defined without being atomic, you should not assume it is atomic.
A simple example to show that your example ordering is possible. I get the "double set" message every time I run it:
$global = nil
$thread = nil
threads = []
threads = Array.new(1000) do
Thread.new do
sleep 1
$thread ||= Thread.new do
if $global
warn "double set!"
else
$global = true
end
end
end
end
threads.each(&:join)

Related

Mutexes not working, using queues works. Why?

In this example I'm looking to sync two puts, in a way that the output will be ababab..., without any double as or bs on the output.
I have three examples for that: Using a queue, using mutexes in memory and using mutex with files. The queue example work just fine, but the mutexes don't.
I'm not looking for a working code. I'm looking to understand why using a queue it works, and using mutexes don't. By my understanding, they are supposed to be equivalent.
Queue example: Work.
def a
Thread.new do
$queue.pop
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$queue << true
end
end
$queue = Queue.new
$queue << true
loop{a; sleep(rand)}
Mutex file example: Don't work.
def a
Thread.new do
$mutex.flock(File::LOCK_EX)
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.flock(File::LOCK_UN)
end
end
MUTEX_FILE_PATH = '/tmp/mutex'
File.open(MUTEX_FILE_PATH, "w") unless File.exists?(MUTEX_FILE_PATH)
$mutex = File.new(MUTEX_FILE_PATH,"r+")
loop{a; sleep(rand)}
Mutex variable example: Don't work.
def a
Thread.new do
$mutex.lock
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.unlock
end
end
$mutex = Mutex.new
loop{a; sleep(rand)}
Short answer
Your use of the mutex is incorrect. With Queue, you can populate with one thread and then pop it with another, but you cannot lock a Mutex with one one thread and then unlock with another.
As #matt explained, there are several subtle things happening like the mutex getting unlocked automatically and the silent exceptions you don't see.
How Mutexes Are Commonly Used
Mutexes are used to access a particular shared resource, like a variable or a file. The synchronization of variables and files consequently allow multiple threads to be synchronized. Mutexes don't really synchronize threads by themselves.
For example:
thread_a and thread_b could be synchronized via a shared boolean variable such as true_a_false_b.
You'd have to access, test, and toggle that boolean variable every time you use it - a multistep process.
It's necessary to ensure that this multistep process occurs atomically, i.e. is not interrupted. This is when you would use a mutex. A trivialized example follows:
require 'thread'
Thread.abort_on_exception = true
true_a_false_b = true
mutex = Mutex.new
thread_a = Thread.new do
loop do
mutex.lock
if true_a_false_b
puts "a"
true_a_false_b = false
end
mutex.unlock
end
end
thread_b = Thread.new do
loop do
mutex.lock
if !true_a_false_b
puts "b"
true_a_false_b = true
end
mutex.unlock
end
sleep(1) # if in irb/console, yield the "current" thread to thread_a and thread_b
In your mutex example, the thread created in method b sleeps for a while, prints b then tries to unlock the mutex. This isn’t legal, a thread cannot unlock a mutex unless it already holds that lock, and raises an ThreadError if you try:
m = Mutex.new
m.unlock
results in:
release.rb:2:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from release.rb:2:in `<main>'
You won’t see this in your example because by default Ruby silently ignores exceptions raised in threads other than the main thread. You can change this using Thread::abort_on_exception= – if you add
Thread.abort_on_exception = true
to the top of your file you’ll see something like:
a
b
with-mutex.rb:15:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from with-mutex.rb:15:in `block in b'
(you might see more than one a, but there’ll only be one b).
In the a method you create threads that acquire a lock, print a, call another method (that creates a new thread and returns straight away) and then terminate. It doesn’t seem to be well documented but when a thread terminates it releases any locks it has, so in this case the lock is released almost immediately allowing other a threads to run.
Overall the lock doesn’t have much effect. It doesn’t prevent the b threads from running at all, and whilst it does prevent two a threads running at the same time, it is released as soon as the thread holding it exits.
I think you might be thinking of semaphores, and whilst the Ruby docs say “Mutex implements a simple semaphore” they are not quite the same.
Ruby doesn’t provide semaphores in the standard library, but it does provide condition variables. (That link goes to the older 2.0.0 docs. The thread standard library is required by default in Ruby 2.1+, and the move seems to have resulted in the current docs not being available. Also be aware that Ruby also has a separate monitor library which (I think) adds the same features (mutexes and condition variables) in a more object-orientated fashion.)
Using condition variables and mutexes you can control the coordination between threads. Uri Agassi’s answer shows one possible way to do that (although I think there’s a race condition with how his solution gets started).
If you look at the source for Queue (again this is a link to 2.0.0 – the thread library has been converted to C in recent versions and the Ruby version is easier to follow) you can see that it is implemented with Mutexes and ConditionVariables. When you call $queue.pop in the a thread in your queue example you end up calling wait on the mutex in the same way as Uri Agassi’s answer calls $cv.wait($mutex) in his method a. Similarly when you call $queue << true in your b thread you end up calling signal on the condition variable in the same way as Uri Agassi’s calls $cv.signal in his b thread.
The main reason your file locking example doesn’t work is that file locking provides a way for multiple processes to coordinate with each other (usually so only one tries to write to a file at the same time) and doesn’t help with coordinating threads within a process. Your file locking code is structured in a similar way to the mutex example so it’s likely it would suffer the same problems.
The problem with file-based version has not been sorted out properly.
The reason why it does not work is that f.flock(File::LOCK_EX) does not block if called on the same file f multiple times.
This can be checked with this simple sequential program:
require 'thread'
MUTEX_FILE_PATH = '/tmp/mutex'
$fone= File.new( MUTEX_FILE_PATH, "w")
$ftwo= File.open( MUTEX_FILE_PATH)
puts "start"
$fone.flock( File::LOCK_EX)
puts "locked"
$fone.flock( File::LOCK_EX)
puts "so what"
$ftwo.flock( File::LOCK_EX)
puts "dontcare"
which prints everything except dontcare.
So the file-based program does not work because
$mutex.flock(File::LOCK_EX)
never blocks.

Why concurrent loop is slower than normal loop in this scenario?

I am learning Threads in Ruby, from The Ruby Programming Language book & found this method which is described as concurrent version of each iterator,
module Enumerable
def concurrently
map {|item| Thread.new { yield item }}.each {|t| t.join }
end
end
The following code
start=Time.now
arr.concurrently{ |n| puts n} # Ran using threads
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 6.6278332
While
start=Time.now
arr.each{ |n| puts n} # Normal each loop
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 0.132975928
Why is it faster without threads ? Is the implementation wrong or the second one has only puts statement while the initial one took time for resource allocation/initialization/terminating the Threads ?
Threads in MRI (the "gold standard" ruby) are not really concurrent. There's a Global VM Lock (GVL) which prevents threads from running concurrently. It allows, however, other threads to run when the current thread is blocked on I/O, but that's not your case.
So, your code runs serially, and you have threading overhead (creating/destroying threads, etc). That's why it's slower.

Parallelism in Ruby

I've got a loop in my Ruby build script that iterates over each project and calls msbuild and does various other bits like minify CSS/JS.
Each loop iteration is independent of the others so I'd like to parallelise it.
How do I do this?
I've tried:
myarray.each{|item|
Thread.start {
# do stuff
}
}
puts "foo"
but Ruby just seems to exit straight away (prints "foo"). That is, it runs over the loop, starts a load of threads, but because there's nothing after the each, Ruby exits killing the other threads :(
I know I can do thread.join, but if I do this inside the loop then it's no longer parallel.
What am I missing?
I'm aware of http://peach.rubyforge.org/ but using that I get all kinds of weird behaviour that look like variable scoping issues that I don't know how to solve.
Edit
It would be useful if I could wait for all child-threads to execute before putting "foo", or at least the main ruby thread exiting. Is this possible?
Store all your threads in an array and loop through the array calling join:
threads = myarray.map do |item|
Thread.start do
# do stuff
end
end
threads.each { |thread| thread.join }
puts "foo"
Use em-synchrony here :). Fibers are cute.
require "em-synchrony"
require "em-synchrony/fiber_iterator"
# if you realy need to get a Fiber per each item
# in real life you could set concurrency to, for example, 10 and it could even improve performance
# it depends on amount of IO in your job
concurrency = myarray.size
EM.synchrony do
EM::Synchrony::FiberIterator.new(myarray, concurrency).each do |url|
# do some job here
end
EM.stop
end
Take into account that ruby threads are green threads, so you dont have natively true parallelism. I f this is what you want I would recommend you to take a look to JRuby and Rubinius:
http://www.engineyard.com/blog/2011/concurrency-in-jruby/

ruby thread block?

I read somewhere that ruby threads/fibre block the IO even with 1.9. Is this true and what does it truly mean? If I do some net/http stuff on multiple threads, is only 1 thread running at a given time for that request?
thanks
Assuming you are using CRuby, only one thread will be running at a time. However, the requests will be made in parallel, because each thread will be blocked on its IO while its IO is not finished. So if you do something like this:
require 'open-uri'
threads = 10.times.map do
Thread.new do
open('http://example.com').read.length
end
end
threads.map &:join
puts threads.map &:value
it will be much faster than doing it sequentially.
Also, you can check to see if a thread is finished w/o blocking on it's completion.
For example:
require 'open-uri'
thread = Thread.new do
sleep 10
open('http://example.com').read.length
end
puts 'still running' until thread.join(5)
puts thread.value
With CRuby, the threads cannot run at the same time, but they are still useful. Some of the other implementations, like JRuby, have real threads and can run multiple threads in parallel.
Some good references:
http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already/
http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/
All threads run simultaneously but IO will be blocked until they all finish.
In other words, threading doesn't give you the ability to "background" a process. The interpreter will wait for all of the threads to complete before sending further messages.
This is good if you think about it because you don't have to wonder about whether they are complete if your next process uses data that the thread is modifying/working with.
If you want to background processes checkout delayed_job

What happens when you don't join your Threads?

I'm writing a ruby program that will be using threads to do some work. The work that is being done takes a non-deterministic amount of time to complete and can range anywhere from 5 to 45+ seconds. Below is a rough example of what the threading code looks like:
loop do # Program loop
items = get_items
threads = []
for item in items
threads << Thread.new(item) do |i|
# do work on i
end
threads.each { |t| t.join } # What happens if this isn't there?
end
end
My preference would be to skip joining the threads and not block the entire application. However I don't know what the long term implications of this are, especially because the code is run again almost immediately. Is this something that is safe to do? Or is there a better way to spawn a thread, have it do work, and clean up when it's finished, all within an infinite loop?
I think it really depends on the content of your thread work. If, for example, your main thread needed to print "X work done", you would need to join to guarantee that you were showing the correct answer. If you have no such requirement, then you wouldn't necessarily need to join up.
After writing the question out, I realized that this is the exact thing that a web server does when serving pages. I googled and found the following article of a Ruby web server. The loop code looks pretty much like mine:
loop do
session = server.accept
request = session.gets
# log stuff
Thread.start(session, request) do |session, request|
HttpServer.new(session, request, basePath).serve()
end
end
Thread.start is effectively the same as Thread.new, so it appears that letting the threads finish and die off is OK to do.
If you split up a workload to several different threads and you need to combine at the end the solutions from the different threads you definately need a join otherwise you could do it without a join..
If you removed the join, you could end up with new items getting started faster than the older ones get finished. If you're working on too many items at once, it may cause performance issues.
You should use a Queue instead (snippet from http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html):
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join

Resources