ruby thread block? - ruby

I read somewhere that ruby threads/fibre block the IO even with 1.9. Is this true and what does it truly mean? If I do some net/http stuff on multiple threads, is only 1 thread running at a given time for that request?
thanks

Assuming you are using CRuby, only one thread will be running at a time. However, the requests will be made in parallel, because each thread will be blocked on its IO while its IO is not finished. So if you do something like this:
require 'open-uri'
threads = 10.times.map do
Thread.new do
open('http://example.com').read.length
end
end
threads.map &:join
puts threads.map &:value
it will be much faster than doing it sequentially.
Also, you can check to see if a thread is finished w/o blocking on it's completion.
For example:
require 'open-uri'
thread = Thread.new do
sleep 10
open('http://example.com').read.length
end
puts 'still running' until thread.join(5)
puts thread.value
With CRuby, the threads cannot run at the same time, but they are still useful. Some of the other implementations, like JRuby, have real threads and can run multiple threads in parallel.
Some good references:
http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already/
http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/

All threads run simultaneously but IO will be blocked until they all finish.
In other words, threading doesn't give you the ability to "background" a process. The interpreter will wait for all of the threads to complete before sending further messages.
This is good if you think about it because you don't have to wonder about whether they are complete if your next process uses data that the thread is modifying/working with.
If you want to background processes checkout delayed_job

Related

How does one achieve parallel tasks with Ruby's Fibers?

I'm new to fibers and EventMachine, and have only recently found out about fibers when I was seeing if Ruby had any concurrency features, like go-lang.
There don't seem to be a whole lot of examples out there for real use cases for when you'd use a Fiber.
I did manage to find this: https://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/ (back from 2009!!!)
which has the following code:
require 'eventmachine'
require 'em-http'
require 'fiber'
def async_fetch(url)
f = Fiber.current
http = EventMachine::HttpRequest.new(url).get :timeout => 10
http.callback { f.resume(http) }
http.errback { f.resume(http) }
return Fiber.yield
end
EventMachine.run do
Fiber.new{
puts "Setting up HTTP request #1"
data = async_fetch('http://www.google.com/')
puts "Fetched page #1: #{data.response_header.status}"
EventMachine.stop
}.resume
end
And that's great, async GET request! yay!!! but... how do I actually use it asyncily? The example doesn't have anything beyond creating the containing Fiber.
From what I understand (and don't understand):
async_fetch is blocking until f.resume is called.
f is the current Fiber, which is the wrapping Fiber created in the EventMachine.run block.
the async_fetch yields control flow back to its caller? I'm not sure what this does
why does the wrapping fiber have resume at the end? are fibers paused by default?
Outside of the example, how do I use fibers to say, shoot off a bunch of requests triggered by keyboard commands?
like, for example: every time I type a letter, I make a request to google or something? - normally this requires a thread, which the main thread would tell the parallel thread to launch a thread for each request. :-\
I'm new to concurrency / Fibers. But they are so intriguing!
If anyone can answers these questions, that would be very appreciated!!!
There is a lot of confusion regarding fibers in Ruby. Fibers are not a tool with which to implement concurrency; they are merely a way of organizing code in a way that may more clearly represent what is going on.
That the name 'fibers' is similar to 'threads' in my opinion contributes to the confusion.
If you want true concurrency, that is, distributing the CPU load across all available CPU's, you have the following options:
In MRI Ruby
Running multiple Ruby VM's (i.e. OS processes), using fork, etc. Even with multiple threads in Ruby, the GIL (Global Interpreter Lock) prevents the use of more than 1 CPU by the Ruby runtime.
In JRuby
Unlike MRI Ruby, JRuby will use multiple CPU's when assigning threads, so you can get truly concurrent processing.
If your code is spending most of its time waiting for external resources, then you may not have any need for this true concurrency. MRI threads or some kind of event handling loop will probably work fine for you.

Celluloid async inside ruby blocks does not work

Trying to implement Celluloid async on my working example seem to exhibit weird behavior.
here my code looks
class Indefinite
include Celluloid
def run!
loop do
[1].each do |i|
async.on_background
end
end
end
def on_background
puts "Running in background"
end
end
Indefinite.new.run!
but when I run the above code, I never see the puts "Running in Background"
But, if I put a sleep the code seem to work.
class Indefinite
include Celluloid
def run!
loop do
[1].each do |i|
async.on_background
end
sleep 0.5
end
end
def on_background
puts "Running in background"
end
end
Indefinite.new.run!
Any idea? why such a difference in the above two scenario.
Thanks.
Your main loop is dominating the actor/application's threads.
All your program is doing is spawning background processes, but never running them. You need that sleep in the loop purely to allow the background threads to get attention.
It is not usually a good idea to have an unconditional loop spawn infinite background processes like you have here. There ought to be either a delay, or a conditional statement put in there... otherwise you just have an infinite loop spawning things that never get invoked.
Think about it like this: if you put puts "looping" just inside your loop, while you do not see Running in the background ... you will see looping over and over and over.
Approach #1: Use every or after blocks.
The best way to fix this is not to use sleep inside a loop, but to use an after or every block, like this:
every(0.1) {
on_background
}
Or best of all, if you want to make sure the process runs completely before running again, use after instead:
def run_method
#running ||= false
unless #running
#running = true
on_background
#running = false
end
after(0.1) { run_method }
end
Using a loop is not a good idea with async unless there is some kind of flow control done, or a blocking process such as with #server.accept... otherwise it will just pull 100% of the CPU core for no good reason.
By the way, you can also use now_and_every as well as now_and_after too... this would run the block right away, then run it again after the amount of time you want.
Using every is shown in this gist:
https://gist.github.com/digitalextremist/686f42e58a58b743142b
The ideal situation, in my opinion:
This is a rough but immediately usable example:
https://gist.github.com/digitalextremist/12fc824c6a4dbd94a9df
require 'celluloid/current'
class Indefinite
include Celluloid
INTERVAL = 0.5
ONE_AT_A_TIME = true
def self.run!
puts "000a Instantiating."
indefinite = new
indefinite.run
puts "000b Running forever:"
sleep
end
def initialize
puts "001a Initializing."
#mutex = Mutex.new if ONE_AT_A_TIME
#running = false
puts "001b Interval: #{INTERVAL}"
end
def run
puts "002a Running."
unless ONE_AT_A_TIME && #running
if ONE_AT_A_TIME
#mutex.synchronize {
puts "002b Inside lock."
#running = true
on_background
#running = false
}
else
puts "002b Without lock."
on_background
end
end
puts "002c Setting new timer."
after(INTERVAL) { run }
end
def on_background
if ONE_AT_A_TIME
puts "003 Running background processor in foreground."
else
puts "003 Running in background"
end
end
end
Indefinite.run!
puts "004 End of application."
This will be its output, if ONE_AT_A_TIME is true:
000a Instantiating.
001a Initializing.
001b Interval: 0.5
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
000b Running forever:
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
And this will be its output if ONE_AT_A_TIME is false:
000a Instantiating.
001a Initializing.
001b Interval: 0.5
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
000b Running forever:
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
You need to be more "evented" than "threaded" to properly issue tasks and preserve scope and state, rather than issue commands between threads/actors... which is what the every and after blocks provide. And besides that, it's good practice either way, even if you didn't have a Global Interpreter Lock to deal with, because in your example, it doesn't seem like you are dealing with a blocking process. If you had a blocking process, then by all means have an infinite loop. But since you're just going to end up spawning an infinite number of background tasks before even one is processed, you need to either use a sleep like your question started with, or use a different strategy altogether, and use every and after which is how Celluloid itself encourages you to operate when it comes to handling data on sockets of any kind.
Approach #2: Use a recursive method call.
This just came up in the Google Group. The below example code will actually allow execution of other tasks, even though it's an infinite loop.
https://groups.google.com/forum/#!topic/celluloid-ruby/xmkdrMQBGbY
This approach is less desirable because it will likely have more overhead, spawning a series of fibers.
def work
# ...
async.work
end
Question #2: Thread vs. Fiber behaviors.
The second question is why the following would work: loop { Thread.new { puts "Hello" } }
That spawns an infinite number of process threads, which are managed by the RVM directly. Even though there is a Global Interpreter Lock in the RVM you are using... that only means no green threads are used, which are provided by the operating system itself... instead these are handled by the process itself. The CPU scheduler for the process runs each Thread itself, without hesitation. And in the case of the example, the Thread runs very quickly and then dies.
Compared to an async task, a Fiber is used. So what's happening is this, in the default case:
Process starts.
Actor instantiated.
Method call invokes loop.
Loop invokes async method.
async method adds task to mailbox.
Mailbox is not invoked, and loop continues.
Another async task is added to the mailbox.
This continues infinitely.
The above is because the loop method itself is a Fiber call, which is not ever being suspended ( unless a sleep is called! ) and therefore the additional task added to the mailbox is never an invoking a new Fiber. A Fiber behaves differently than a Thread. This is a good piece of reference material discussing the differences:
https://blog.engineyard.com/2010/concurrency-real-and-imagined-in-mri-threads
Question #3: Celluloid vs. Celluloid::ZMQ behavior.
The third question is why include Celluloid behaves differently than Celluloid::ZMQ ...
That's because Celluloid::ZMQ uses a reactor-based evented mailbox, versus Celluloid which uses a condition variable based mailbox.
Read more about pipelining and execution modes:
https://github.com/celluloid/celluloid/wiki/Pipelining-and-execution-modes
That is the difference between the two examples. If you have additional questions about how these mailboxes behave, feel free to post on the Google Group ... the main dynamic you are facing is the unique nature of the GIL interacting with the Fiber vs. Thread vs. Reactor behavior.
You can read more about the reactor-pattern here:
http://en.wikipedia.org/wiki/Reactor_pattern
Explanation of the "Reactor pattern"
What is the difference between event driven model and reactor pattern?
And see the specific reactor used by Celluloid::ZMQ here:
https://github.com/celluloid/celluloid-zmq/blob/master/lib/celluloid/zmq/reactor.rb
So what's happening in the evented mailbox scenario, is that when sleep is hit, that is a blocking call, which causes the reactor to move to the next task in the mailbox.
But also, and this is unique to your situation, the specific reactor being used by Celluloid::ZMQ is using an eternal C library... specifically the 0MQ library. That reactor is external to your application, which behaves differently than Celluloid::IO or Celluloid itself, and that is also why the behavior is occurring differently than you expected.
Multi-core Support Alternative
If maintaining state and scope is not important to you, if you use jRuby or Rubinius which are not limited to one operating system thread, versus using MRI which has the Global Interpreter Lock, you can instantiate more than one actor and issue async calls between actors concurrently.
But my humble opinion is that you would be much better served using a very high frequency timer, such as 0.001 or 0.1 in my example, which will seem instantaneous for all intents and purposes, but also allow the actor thread plenty of time to switch fibers and run other tasks in the mailbox.
Let's make an experiment, by modifying your example a bit (we modify it because this way we get the same "weird" behaviour, while making things clearner):
class Indefinite
include Celluloid
def run!
(1..100).each do |i|
async.on_background i
end
puts "100 requests sent from #{Actor.current.object_id}"
end
def on_background(num)
(1..100000000).each {}
puts "message #{num} on #{Actor.current.object_id}"
end
end
Indefinite.new.run!
sleep
# =>
# 100 requests sent from 2084
# message 1 on 2084
# message 2 on 2084
# message 3 on 2084
# ...
You can run it on any Ruby interpreter, using Celluloid or Celluloid::ZMQ, the result always will be the same. Also note that, output from Actor.current.object_id is the same in both methods, giving us the clue, that we are dealing with a single actor in our experiment.
So there is not much difference between ruby and Celluloid implementations, as long as this experiment is concerned.
Let's first address why this code behaves in this way?
It's not hard to understand why it's happening. Celluloid is receiving incoming requests and saving them in the queue of tasks for appropriate actor. Note, that our original call to run! is on the top of the queue.
Celluloid then processes those tasks, one at a time. If there happens to be a blocking call or sleep call, according to the documentation, the next task will be invoked, not waiting for the current task to be completed.
Note, that in our experiment there are no blocking calls. It means, that the run! method will be executed from the beginning to the end, and only after it's done, each of the on_background calls will be invoked in the perfect order.
And it's how it's supposed to work.
If you add sleep call in your code, it will notify Celluloid, that it should start processing of the next task in queue. Thus, the behavior, you have in your second example.
Let's now continue to the part on how to design the system, so that it does not depend on sleep calls, which is weird at least.
Actually there is a good example at Celluloid-ZMQ project page. Note this loop:
def run
loop { async.handle_message #socket.read }
end
The first thing it does is #socket.read. Note that it's a blocking operation. So, Celluloid will process to the next message in the queue (if there are any). As soon as #socket.read responds, a new task will be generated. But this task won't be executed before #socket.read is called again, thus blocking execution, and notifying Celluloid to process with the next item on the queue.
You probably see the difference with your example. You are not blocking anything, thus not giving Celluloid a chance to process with queue.
How can we get behavior given in Celluloid::ZMQ example?
The first (in my opinion, better) solution is to have actual blocking call, like #socket.read.
If there are no blocking calls in your code and you still need to process things in background, then you should consider other mechanisms provided by Celluloid.
There are several options with Celluloid.
One can use conditions, futures, notifications, or just calling wait/signal on low level, like in this example:
class Indefinite
include Celluloid
def run!
loop do
async.on_background
result = wait(:background) #=> 33
end
end
def on_background
puts "background"
# notifies waiters, that they can continue
signal(:background, 33)
end
end
Indefinite.new.run!
sleep
# ...
# background
# background
# background
# ...
Using sleep(0) with Celluloid::ZMQ
I also noticed working.rb file you mentioned in your comment. It contains the following loop:
loop { [1].each { |i| async.handle_message 'hello' } ; sleep(0) }
It looks like it's doing the proper job. Actually, running it under jRuby revealed, it's leaking memory. To make it even more apparent, try to add a sleep call into the handle_message body:
def handle_message(message)
sleep 0.5
puts "got message: #{message}"
end
High memory usage is probably related to the fact, that queue is filled very fast and cannot be processed in given time. It will be more problematic, if handle_message is more work-intensive, then it's now.
Solutions with sleep
I'm skeptical about solutions with sleep. They potentially require much memory and even generate memory leaks. And it's not clear what should you pass as a parameter to the sleep method and why.
How threads work with Celluloid
Celluloid is not creating a new thread for each asynchronous task. It has a pool of threads in which it runs every task, synchronous and asynchronous ones. The key point is that the library sees the run! function as a synchronous task, and performs it in the same context than an asynchronous task.
By default, Celluloid runs everything in a single thread, using a queue system to schedule asynchronous tasks for later. It creates new threads only when needed.
Besides that, Celluloid overrides the sleep function. It means that every time you call sleep in a class extending the Celluloid class, the library will check if there are non-sleeping threads in its pool.
In your case, the first time you call sleep 0.5, it will create a new Thread to perform the asynchronous tasks in the queue while the first thread is sleeping.
So in your first example, only one Celluloid thread is running, performing the loop. In your second example, two Celluloid threads are running, the first one performing the loop and sleeping at each iteration, the other one performing the background task.
You could for instance change your first example to perform a finite number of iterations:
def run!
(0..100).each do
[1].each do |i|
async.on_background
end
end
puts "Done!"
end
When using this run! function, you'll see that Done! is printed before all the Running in background, meaning that Celluloid finishes the execution of the run! function before starting the asynchronous tasks in the same thread.

Any way to snipe or terminate specific sidekiq workers?

Is it possible to snipe or cancel specific Sidekiq workers/running jobs - effectively invoking an exception or something into the worker thread to terminate it.
I have some fairly simple background ruby (MRI 1.9.3) jobs under Sidekiq (latest) that run fine and are dependent on external systems. The external systems can take varying amounts of time during which the worker must remain available.
I think I can use Sidekiq's API to get to the appropriate worker - but I don't see any 'terminate/cancel/quite/exit' methods in the docs - is this possible? Is this something other people have done?
Ps. I know I could use an async loop within the workers job to trap relevant signals and shut itself down ..but that will complicate things a bit due to the nature of the external systems.
Async loop is the best way to do it as sidekiq has no way to terminate running job.
def perform
main_thread = Thread.new do
ActiveRecord::Base.connection_pool.with_connection do
begin
# ...
ensure
$redis.set some_thread_key, 1
end
end
end
watcher_thread = Thread.new do
ActiveRecord::Base.connection_pool.with_connection do
until $redis.del(some_thread_key) == 1 do
sleep 1
end
main_thread.kill
until !!main_thread.status == false do
sleep 0.1
end
end
end
[main_thread, watcher_thread].each(&:join)
end

Ruby Wait and Signal

I have two threads in a ruby process. What I want to do is have one sleep and the other send a signal to wake up.
I know how to do it with Mutex and ConditionalVariables, but I don't have a critic section to run so it's not the right solution.
I know how to do it with thread stop and thread run, where on thread stops itself and the other calls run on it, but It's now what I'm really looking for.
Is there other way? What I'm trying to accomplish is have on thread wait for content in the database and the other notify when there is content.
Maybe something like this will work?
require "thread"
q = Queue.new
Thread.new do
sleep 10
q.push "*"
end
p q.pop

Parallelism in Ruby

I've got a loop in my Ruby build script that iterates over each project and calls msbuild and does various other bits like minify CSS/JS.
Each loop iteration is independent of the others so I'd like to parallelise it.
How do I do this?
I've tried:
myarray.each{|item|
Thread.start {
# do stuff
}
}
puts "foo"
but Ruby just seems to exit straight away (prints "foo"). That is, it runs over the loop, starts a load of threads, but because there's nothing after the each, Ruby exits killing the other threads :(
I know I can do thread.join, but if I do this inside the loop then it's no longer parallel.
What am I missing?
I'm aware of http://peach.rubyforge.org/ but using that I get all kinds of weird behaviour that look like variable scoping issues that I don't know how to solve.
Edit
It would be useful if I could wait for all child-threads to execute before putting "foo", or at least the main ruby thread exiting. Is this possible?
Store all your threads in an array and loop through the array calling join:
threads = myarray.map do |item|
Thread.start do
# do stuff
end
end
threads.each { |thread| thread.join }
puts "foo"
Use em-synchrony here :). Fibers are cute.
require "em-synchrony"
require "em-synchrony/fiber_iterator"
# if you realy need to get a Fiber per each item
# in real life you could set concurrency to, for example, 10 and it could even improve performance
# it depends on amount of IO in your job
concurrency = myarray.size
EM.synchrony do
EM::Synchrony::FiberIterator.new(myarray, concurrency).each do |url|
# do some job here
end
EM.stop
end
Take into account that ruby threads are green threads, so you dont have natively true parallelism. I f this is what you want I would recommend you to take a look to JRuby and Rubinius:
http://www.engineyard.com/blog/2011/concurrency-in-jruby/

Resources