Parallelism in Ruby - ruby

I've got a loop in my Ruby build script that iterates over each project and calls msbuild and does various other bits like minify CSS/JS.
Each loop iteration is independent of the others so I'd like to parallelise it.
How do I do this?
I've tried:
myarray.each{|item|
Thread.start {
# do stuff
}
}
puts "foo"
but Ruby just seems to exit straight away (prints "foo"). That is, it runs over the loop, starts a load of threads, but because there's nothing after the each, Ruby exits killing the other threads :(
I know I can do thread.join, but if I do this inside the loop then it's no longer parallel.
What am I missing?
I'm aware of http://peach.rubyforge.org/ but using that I get all kinds of weird behaviour that look like variable scoping issues that I don't know how to solve.
Edit
It would be useful if I could wait for all child-threads to execute before putting "foo", or at least the main ruby thread exiting. Is this possible?

Store all your threads in an array and loop through the array calling join:
threads = myarray.map do |item|
Thread.start do
# do stuff
end
end
threads.each { |thread| thread.join }
puts "foo"

Use em-synchrony here :). Fibers are cute.
require "em-synchrony"
require "em-synchrony/fiber_iterator"
# if you realy need to get a Fiber per each item
# in real life you could set concurrency to, for example, 10 and it could even improve performance
# it depends on amount of IO in your job
concurrency = myarray.size
EM.synchrony do
EM::Synchrony::FiberIterator.new(myarray, concurrency).each do |url|
# do some job here
end
EM.stop
end

Take into account that ruby threads are green threads, so you dont have natively true parallelism. I f this is what you want I would recommend you to take a look to JRuby and Rubinius:
http://www.engineyard.com/blog/2011/concurrency-in-jruby/

Related

Why concurrent loop is slower than normal loop in this scenario?

I am learning Threads in Ruby, from The Ruby Programming Language book & found this method which is described as concurrent version of each iterator,
module Enumerable
def concurrently
map {|item| Thread.new { yield item }}.each {|t| t.join }
end
end
The following code
start=Time.now
arr.concurrently{ |n| puts n} # Ran using threads
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 6.6278332
While
start=Time.now
arr.each{ |n| puts n} # Normal each loop
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 0.132975928
Why is it faster without threads ? Is the implementation wrong or the second one has only puts statement while the initial one took time for resource allocation/initialization/terminating the Threads ?
Threads in MRI (the "gold standard" ruby) are not really concurrent. There's a Global VM Lock (GVL) which prevents threads from running concurrently. It allows, however, other threads to run when the current thread is blocked on I/O, but that's not your case.
So, your code runs serially, and you have threading overhead (creating/destroying threads, etc). That's why it's slower.

Building an asynchronous queue in Ruby

I need to process jobs off of a queue within a process, with IO performed asynchronously. That's pretty straightforward. The gotcha is that those jobs can add additional items to the queue.
I think I've been fiddling with this problem too long so my brain is cloudy — it shouldn't be too difficult. I keep coming up with an either-or scenario:
The queue can perform jobs asynchronously and results can be joined in afterward.
The queue can synchronously perform jobs until the last finishes and the queue is empty.
I've been fiddling with everything from EventMachine and Goliath (both of which can use EM::HttpRequest) to Celluloid (never actually got around to building something with it though), and writing Enumerators using Fibers. My brain is fried though.
What I'd like, simply, is to be able to do this:
items = [1,2,3]
items.each do |item|
if item.has_particular_condition?
items << item.process_one_way
elsif item.other_condition?
items << item.process_another_way
# ...
end
end
#=> [1,2,3,4,5,6]
...where 4, 5, and 6 were all results of processing the original items in the set, and 7, 8, and 9 are results from processing 4, 5, and 6. I don't need to worry about indefinitely processing the queue because the data I'm processing will end after a couple of iterations.
High-level guidance, comments, links to other libraries, etc are all welcome, as well as lower-level implementation code examples.
I have had similar requirements in the past and what you need is a solid, high performance work queue from the sounds of it. I recommend you check out beanstalkd which I discovered over a year ago and have since been using to process thousands and thousands of jobs reliably in ruby.
In particular, I have started developing solid ruby libraries around beanstalkd. In particular, be sure to check out backburner which is a production ready work queue in ruby using beanstalkd. The syntax and setup are easy, defining how jobs process is quick, handling job failures and retries is all built in as are job scheduling and a lot more.
Let me know if you have any questions but I think beanstalkd and backburner would fit your requirements quite well.
I wound up implementing something a little less ideal — basically just wrapping an EM Fiber Iterator in a loop that terminates once no new results are queued.
require 'set'
class SetRunner
def initialize(seed_queue)
#results = seed_queue.to_set
end
def run
begin
yield last_loop_results, result_bucket
end until new_loop_results.empty?
return #results
end
def last_loop_results
result_bucket.shift(result_bucket.count)
end
def result_bucket
#result_bucket ||= #results.to_a
end
def new_loop_results
# .add? returns nil if already in the set
result_bucket.each { |item| #results.add? item }.compact
end
end
Then, to use it with EventMachine:
queue = [1,2,3]
results = SetRunner.new(queue).run do |set, output|
EM::Synchrony::FiberIterator.new(set, 3).each do |item|
output.push(item + 3) if item <= 6
end
end
# => [1,2,3,4,5,6,7,8,9]
Then each set will get run with the concurrency level passed to the FiberIterator, but the results from each set will get run in the next iteration of the outer SetRunner loop.

ruby thread block?

I read somewhere that ruby threads/fibre block the IO even with 1.9. Is this true and what does it truly mean? If I do some net/http stuff on multiple threads, is only 1 thread running at a given time for that request?
thanks
Assuming you are using CRuby, only one thread will be running at a time. However, the requests will be made in parallel, because each thread will be blocked on its IO while its IO is not finished. So if you do something like this:
require 'open-uri'
threads = 10.times.map do
Thread.new do
open('http://example.com').read.length
end
end
threads.map &:join
puts threads.map &:value
it will be much faster than doing it sequentially.
Also, you can check to see if a thread is finished w/o blocking on it's completion.
For example:
require 'open-uri'
thread = Thread.new do
sleep 10
open('http://example.com').read.length
end
puts 'still running' until thread.join(5)
puts thread.value
With CRuby, the threads cannot run at the same time, but they are still useful. Some of the other implementations, like JRuby, have real threads and can run multiple threads in parallel.
Some good references:
http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already/
http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/
All threads run simultaneously but IO will be blocked until they all finish.
In other words, threading doesn't give you the ability to "background" a process. The interpreter will wait for all of the threads to complete before sending further messages.
This is good if you think about it because you don't have to wonder about whether they are complete if your next process uses data that the thread is modifying/working with.
If you want to background processes checkout delayed_job

Improving ruby code performance with threading not working

I am trying to optimize this piece of ruby code with Thread, which involves a fair bit of IO and network activity. Unfortunately that is not going down too well.
# each host is somewhere in the local network
hosts.each { |host|
# reach out to every host with something for it to do
# wait for host to complete the work and get back
}
My original plan was to wrap the internal of the loop into a new Thread for each iteration. Something like:
# each host is somewhere in the local network
hosts.each { |host|
Thread.new {
# reach out to every host with something for it to do
# wait for host to complete the work and get back
}
}
# join all threads here before main ends
I was hoping that since this I/O bound even without ruby 1.9 I should be able to gain something but nope, nothing. Any ideas how this might be improved?
I'm not sure how many hosts you have, but if it's a lot you may want to try a producer/consumer model with a fixed number of threads:
require 'thread'
THREADS_COUNT = (ARGV[0] || 4).to_i
$q = Queue.new
hosts.each { |h| $q << h }
threads = (1..THREADS_COUNT).map {
Thread.new {
begin
loop {
host = $q.shift(true)
# do something with host
}
rescue ThreadError
# queue is empty, pass
end
}
}
threads.each(&:join)
If this all too complicated, you may try using xargs -P. ;)
#Fanatic23, before you draw any conclusions, instrument your code with puts and see whether the network requests are actually overlapping. In each call to puts, print a status string indicating the line which is executing, along with Time.now and Time.now.nsec.
You say "since this [is] I/O bound even without ruby 1.9 I should be able to gain something".
Vain hope. :-)
When a thread in Ruby 1.8 blocks on IO, the entire process has blocked on IO. This is because it's using green theads.
Upgrade to Ruby 1.9, and you'll have access to your platform's native threads implementation. For more, see:
http://en.wikipedia.org/wiki/Green_threads
Enjoy!

What happens when you don't join your Threads?

I'm writing a ruby program that will be using threads to do some work. The work that is being done takes a non-deterministic amount of time to complete and can range anywhere from 5 to 45+ seconds. Below is a rough example of what the threading code looks like:
loop do # Program loop
items = get_items
threads = []
for item in items
threads << Thread.new(item) do |i|
# do work on i
end
threads.each { |t| t.join } # What happens if this isn't there?
end
end
My preference would be to skip joining the threads and not block the entire application. However I don't know what the long term implications of this are, especially because the code is run again almost immediately. Is this something that is safe to do? Or is there a better way to spawn a thread, have it do work, and clean up when it's finished, all within an infinite loop?
I think it really depends on the content of your thread work. If, for example, your main thread needed to print "X work done", you would need to join to guarantee that you were showing the correct answer. If you have no such requirement, then you wouldn't necessarily need to join up.
After writing the question out, I realized that this is the exact thing that a web server does when serving pages. I googled and found the following article of a Ruby web server. The loop code looks pretty much like mine:
loop do
session = server.accept
request = session.gets
# log stuff
Thread.start(session, request) do |session, request|
HttpServer.new(session, request, basePath).serve()
end
end
Thread.start is effectively the same as Thread.new, so it appears that letting the threads finish and die off is OK to do.
If you split up a workload to several different threads and you need to combine at the end the solutions from the different threads you definately need a join otherwise you could do it without a join..
If you removed the join, you could end up with new items getting started faster than the older ones get finished. If you're working on too many items at once, it may cause performance issues.
You should use a Queue instead (snippet from http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html):
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join

Resources