What happens when you don't join your Threads? - ruby

I'm writing a ruby program that will be using threads to do some work. The work that is being done takes a non-deterministic amount of time to complete and can range anywhere from 5 to 45+ seconds. Below is a rough example of what the threading code looks like:
loop do # Program loop
items = get_items
threads = []
for item in items
threads << Thread.new(item) do |i|
# do work on i
end
threads.each { |t| t.join } # What happens if this isn't there?
end
end
My preference would be to skip joining the threads and not block the entire application. However I don't know what the long term implications of this are, especially because the code is run again almost immediately. Is this something that is safe to do? Or is there a better way to spawn a thread, have it do work, and clean up when it's finished, all within an infinite loop?

I think it really depends on the content of your thread work. If, for example, your main thread needed to print "X work done", you would need to join to guarantee that you were showing the correct answer. If you have no such requirement, then you wouldn't necessarily need to join up.

After writing the question out, I realized that this is the exact thing that a web server does when serving pages. I googled and found the following article of a Ruby web server. The loop code looks pretty much like mine:
loop do
session = server.accept
request = session.gets
# log stuff
Thread.start(session, request) do |session, request|
HttpServer.new(session, request, basePath).serve()
end
end
Thread.start is effectively the same as Thread.new, so it appears that letting the threads finish and die off is OK to do.

If you split up a workload to several different threads and you need to combine at the end the solutions from the different threads you definately need a join otherwise you could do it without a join..

If you removed the join, you could end up with new items getting started faster than the older ones get finished. If you're working on too many items at once, it may cause performance issues.
You should use a Queue instead (snippet from http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html):
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join

Related

How to pass a block to a yielding thread in Ruby

I am trying to wrap my head around Threads and Yielding in Ruby, and I have a question about how to pass a block to a yielding thread.
Specifically, I have a thread that is sleeping, and waiting to be told to do something, and I would like that Thread to execute a different block if told to (ie, it is sleeping, and if a user presses a button, do something besides sleep).
Say I have code like this:
window = Thread.new do
#thread1 = Thread.new do
# Do some cool stuff
# Decide it is time to sleep
until #told_to_wakeup
if block_given?
yield
end
sleep(1)
end
# At some point after #thread1 starts sleeping,
# a user might do something, so I want to execute
# some code in ##thread1 (unfortunately spawning a new thread
# won't work correctly in my case)
end
Is it possible to do that?
I tried using ##thread1.send(), but send was looking for a method name.
Thanks for taking the time to look at this!
Here's a simple worker thread:
queue = Queue.new
worker = Thread.new do
# Fetch an item from the work queue, or wait until one is available
while (work = queue.pop)
# ... Do something with work
end
end
queue.push(thing: 'to do')
The pop method will block until something is pushed into the queue.
When you're done you can push in a deliberately empty job:
queue.push(nil)
That will make the worker thread exit.
You can always expand on that functionality to do more things, or to handle more conditions.

Why concurrent loop is slower than normal loop in this scenario?

I am learning Threads in Ruby, from The Ruby Programming Language book & found this method which is described as concurrent version of each iterator,
module Enumerable
def concurrently
map {|item| Thread.new { yield item }}.each {|t| t.join }
end
end
The following code
start=Time.now
arr.concurrently{ |n| puts n} # Ran using threads
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 6.6278332
While
start=Time.now
arr.each{ |n| puts n} # Normal each loop
puts "Time Taken #{Time.now-start}"
outputs: Time Taken 0.132975928
Why is it faster without threads ? Is the implementation wrong or the second one has only puts statement while the initial one took time for resource allocation/initialization/terminating the Threads ?
Threads in MRI (the "gold standard" ruby) are not really concurrent. There's a Global VM Lock (GVL) which prevents threads from running concurrently. It allows, however, other threads to run when the current thread is blocked on I/O, but that's not your case.
So, your code runs serially, and you have threading overhead (creating/destroying threads, etc). That's why it's slower.

Building an asynchronous queue in Ruby

I need to process jobs off of a queue within a process, with IO performed asynchronously. That's pretty straightforward. The gotcha is that those jobs can add additional items to the queue.
I think I've been fiddling with this problem too long so my brain is cloudy — it shouldn't be too difficult. I keep coming up with an either-or scenario:
The queue can perform jobs asynchronously and results can be joined in afterward.
The queue can synchronously perform jobs until the last finishes and the queue is empty.
I've been fiddling with everything from EventMachine and Goliath (both of which can use EM::HttpRequest) to Celluloid (never actually got around to building something with it though), and writing Enumerators using Fibers. My brain is fried though.
What I'd like, simply, is to be able to do this:
items = [1,2,3]
items.each do |item|
if item.has_particular_condition?
items << item.process_one_way
elsif item.other_condition?
items << item.process_another_way
# ...
end
end
#=> [1,2,3,4,5,6]
...where 4, 5, and 6 were all results of processing the original items in the set, and 7, 8, and 9 are results from processing 4, 5, and 6. I don't need to worry about indefinitely processing the queue because the data I'm processing will end after a couple of iterations.
High-level guidance, comments, links to other libraries, etc are all welcome, as well as lower-level implementation code examples.
I have had similar requirements in the past and what you need is a solid, high performance work queue from the sounds of it. I recommend you check out beanstalkd which I discovered over a year ago and have since been using to process thousands and thousands of jobs reliably in ruby.
In particular, I have started developing solid ruby libraries around beanstalkd. In particular, be sure to check out backburner which is a production ready work queue in ruby using beanstalkd. The syntax and setup are easy, defining how jobs process is quick, handling job failures and retries is all built in as are job scheduling and a lot more.
Let me know if you have any questions but I think beanstalkd and backburner would fit your requirements quite well.
I wound up implementing something a little less ideal — basically just wrapping an EM Fiber Iterator in a loop that terminates once no new results are queued.
require 'set'
class SetRunner
def initialize(seed_queue)
#results = seed_queue.to_set
end
def run
begin
yield last_loop_results, result_bucket
end until new_loop_results.empty?
return #results
end
def last_loop_results
result_bucket.shift(result_bucket.count)
end
def result_bucket
#result_bucket ||= #results.to_a
end
def new_loop_results
# .add? returns nil if already in the set
result_bucket.each { |item| #results.add? item }.compact
end
end
Then, to use it with EventMachine:
queue = [1,2,3]
results = SetRunner.new(queue).run do |set, output|
EM::Synchrony::FiberIterator.new(set, 3).each do |item|
output.push(item + 3) if item <= 6
end
end
# => [1,2,3,4,5,6,7,8,9]
Then each set will get run with the concurrency level passed to the FiberIterator, but the results from each set will get run in the next iteration of the outer SetRunner loop.

Parallelism in Ruby

I've got a loop in my Ruby build script that iterates over each project and calls msbuild and does various other bits like minify CSS/JS.
Each loop iteration is independent of the others so I'd like to parallelise it.
How do I do this?
I've tried:
myarray.each{|item|
Thread.start {
# do stuff
}
}
puts "foo"
but Ruby just seems to exit straight away (prints "foo"). That is, it runs over the loop, starts a load of threads, but because there's nothing after the each, Ruby exits killing the other threads :(
I know I can do thread.join, but if I do this inside the loop then it's no longer parallel.
What am I missing?
I'm aware of http://peach.rubyforge.org/ but using that I get all kinds of weird behaviour that look like variable scoping issues that I don't know how to solve.
Edit
It would be useful if I could wait for all child-threads to execute before putting "foo", or at least the main ruby thread exiting. Is this possible?
Store all your threads in an array and loop through the array calling join:
threads = myarray.map do |item|
Thread.start do
# do stuff
end
end
threads.each { |thread| thread.join }
puts "foo"
Use em-synchrony here :). Fibers are cute.
require "em-synchrony"
require "em-synchrony/fiber_iterator"
# if you realy need to get a Fiber per each item
# in real life you could set concurrency to, for example, 10 and it could even improve performance
# it depends on amount of IO in your job
concurrency = myarray.size
EM.synchrony do
EM::Synchrony::FiberIterator.new(myarray, concurrency).each do |url|
# do some job here
end
EM.stop
end
Take into account that ruby threads are green threads, so you dont have natively true parallelism. I f this is what you want I would recommend you to take a look to JRuby and Rubinius:
http://www.engineyard.com/blog/2011/concurrency-in-jruby/

Deadlock in ruby code using SizedQueue

I think I'm running up against a fundamental misunderstanding on my part of how threading works in ruby and I'm hoping to get some insight.
I'd like to have a simple producer and consumer. First, a producer thread that pulls lines from a file and sticks them into a SizedQueue; when those run out, stick some tokens on the end to let the consumer(s) know things are done.
require 'thread'
numthreads = 2
filename = 'edition-2009-09-11.txt'
bq = SizedQueue.new(4)
producerthread = Thread.new(bq) do |queue|
File.open(filename) do |f|
f.each do |r|
queue << r
end
end
numthreads.times do
queue << :end_of_producer
end
end
Now a few consumers. For simplicity, let's have them do nothing.
consumerthreads = []
numthreads.times do
consumerthreads << Thread.new(bq) do |queue|
until (line = queue.pop) === :end_of_producer
# do stuff in here
end
end
end
producerthread.join
consumerthreads.each {|t| t.join}
puts "All done"
My understanding is that (a) the producer thread will block once the SizedQueue is full and eventually get back to filling it up, and (b) the consumer threads will pull from the SizedQueue, blocking when it empties, and eventually finish.
But under ruby1.9 (ruby 1.9.1p243 (2009-07-16 revision 24175) [i386-darwin9]) I get a deadlock error on the joins. What's going on here? I just don't see where there's any interaction between the threads except via the SizedQueue,which is supposed to be thread-safe.
Any insight would be much-appreciated.
Your understanding is correct and your code works on my machine, on a slightly newer version of Ruby (both ruby 1.9.2dev (2009-08-30 trunk 24705) [i386-darwin10.0.0] and ruby 1.9.2dev (2009-08-30 trunk 24705) [i386-darwin10.0.0])

Resources