Understanding Celluloid Concurrency - ruby

Following are my Celluloid codes.
client1.rb One of the 2 clients. (I named it as client 1)
client2.rb 2nd of the 2 clients. (named as client 2 )
Note:
the only the difference between the above 2 clients is the text that is passed to the server. i.e ('client-1' and 'client-2' respectively)
On testing this 2 clients (by running them side by side) against following 2 servers (one at time). I found very strange results.
server1.rb (a basic example taken from the README.md of the celluloid-zmq)
Using this as the example server for the 2 above clients resulted in parallel executions of tasks.
OUTPUT
ruby server1.rb
Received at 04:59:39 PM and message is client-1
Going to sleep now
Received at 04:59:52 PM and message is client-2
Note:
the client2.rb message was processed when client1.rb request was on sleep.(mark of parallelism)
server2.rb
Using this as the example server for the 2 above clients did not resulted in parallel executions of tasks.
OUTPUT
ruby server2.rb
Received at 04:55:52 PM and message is client-1
Going to sleep now
Received at 04:56:52 PM and message is client-2
Note:
the client-2 was ask to wait 60 seconds since client-1 was sleeping(60 seconds sleep)
I ran the above test multiple times all resulted in same behaviour.
Can anyone explain me from the results of the above tests that.
Question: Why is celluloid made to wait for 60 seconds before it can process the other request i.e as noticed in server2.rb case.?
Ruby version
ruby -v
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]

Using your gists, I verified this issue can be reproduced in MRI 2.2.1 as well as jRuby 1.7.21 and Rubinius 2.5.8 ... The difference between server1.rb and server2.rb is the use of the DisplayMessage and message class method in the latter.
Use of sleep in DisplayMessage is out of Celluloid scope.
When sleep is used in server1.rb it is using Celluloid.sleep in actuality, but when used in server2.rb it is using Kernel.sleep ... which locks up the mailbox for Server until 60 seconds have passed. This prevents future method calls on that actor to be processed until the mailbox is processing messages ( method calls on the actor ) again.
There are three ways to resolve this:
Use a defer {} or future {} block.
Explicitly invoke Celluloid.sleep rather than sleep ( if not explicitly invoked as Celluloid.sleep, using sleep will end up calling Kernel.sleep since DisplayMessage does not include Celluloid like Server does )
Bring the contents of DisplayMessage.message into handle_message as in server1.rb; or at least into Server, which is in Celluloid scope, and will use the correct sleep.
The defer {} approach:
def handle_message(message)
defer {
DisplayMessage.message(message)
}
end
The Celluloid.sleep approach:
class DisplayMessage
def self.message(message)
#de ...
Celluloid.sleep 60
end
end
Not truly a scope issue; it's about asynchrony.
To reiterate, the deeper issue is not the scope of sleep ... that's why defer and future are my best recommendation. But to post something here that came out in my comments:
Using defer or future pushes a task that would cause an actor to become tied up into another thread. If you use future, you can get the return value once the task is done, if you use defer you can fire & forget.
But better yet, create another actor for tasks that tend to get tied up, and even pool that other actor... if defer or future don't work for you.
I'd be more than happy to answer follow-up questions brought up by this question; we have a very active mailing list, and IRC channel. Your generous bounties are commendable, but plenty of us would help purely to help you.

Managed to reproduce and fix the issue.
Deleting my previous answer.
Apparently, the problem lies in sleep.
Confirmed by adding logs "actor/kernel sleeping" to the local copy of Celluloids.rb's sleep().
In server1.rb,
the call to sleep is within server - a class that includes Celluloid.
Thus Celluloid's implementation of sleep overrides the native sleep.
class Server
include Celluloid::ZMQ
...
def run
loop { async.handle_message #socket.read }
end
def handle_message(message)
...
sleep 60
end
end
Note the log actor sleeping from server1.rb. Log added to Celluloids.rb's sleep()
This suspends only the current "actor" in Celluloid
i.e. only the current "Celluloid thread" handling the client1 sleeps.
In server2.rb,
the call to sleep is within a different class DisplayMessage that does NOT include Celluloid.
Thus it is the native sleep itself.
class DisplayMessage
def self.message(message)
...
sleep 60
end
end
Note the ABSENCE of any actor sleeping log from server2.rb.
This suspends the current ruby task i.e. the ruby server sleeps (not just a single Celluloid actor).
The Fix?
In server2.rb, the appropriate sleep must be explicitly specified.
class DisplayMessage
def self.message(message)
puts "Received at #{Time.now.strftime('%I:%M:%S %p')} and message is #{message}"
## Intentionally added sleep to test whether Celluloid block the main process for 60 seconds or not.
if message == 'client-1'
puts 'Going to sleep now'.red
# "sleep 60" will invoke the native sleep.
# Use Celluloid.sleep to support concurrent execution
Celluloid.sleep 60
end
end
end

Related

Why do Ruby fibers that run sequentially without a scheduler set run concurrently when a scheduler is set?

I have the following Gemfile:
source "https://rubygems.org"
ruby "3.1.2"
gem "libev_scheduler", "~> 0.2"
and the following Ruby code in a file called main.rb:
require 'libev_scheduler'
set_sched = ARGV[0] == "--set-sched"
if set_sched then
Fiber.set_scheduler Libev::Scheduler.new
end
N_FIBERS = 5
fibers = []
N_FIBERS.times do |i|
n = i + 1
fiber = Fiber.new do
puts "Beginning calculation ##{n}..."
sleep 1
end
fibers.push({fiber: fiber, n: n})
end
fibers.each do |fiber|
fiber[:fiber].resume
end
puts "Finished all calculations!"
I'm executing the code with Ruby 3.1.2 installed via RVM.
When I run the program with time bundle exec ruby main.rb, I get the following output:
Beginning calculation #1...
Beginning calculation #2...
Beginning calculation #3...
Beginning calculation #4...
Beginning calculation #5...
Finished all calculations!
real 0m5.179s
user 0m0.146s
sys 0m0.027s
When I run the program with time bundle exec ruby main.rb --set-sched, I get the following output:
Beginning calculation #1...
Beginning calculation #2...
Beginning calculation #3...
Beginning calculation #4...
Beginning calculation #5...
Finished all calculations!
real 0m1.173s
user 0m0.150s
sys 0m0.021s
Why do my fibers only run concurrently when I've set a scheduler? Some older Stack Overflow answers (like this one) state that fibers are a construct for flow control, not concurrency, and that it is impossible to use fibers to write concurrent code. My results seem to contradict this.
My understanding so far of fibers is that they are meant for cooperative concurrency, as opposed to preemptive concurrency. Therefore, in order to get concurrency out of them, you'd need to have them yield to some other code as early as they can (ex. when IO begins) so that the other code can be executed while the fiber waits for its next opportunity to execute.
Based on this understanding, I think I understand why my code without a scheduler isn't able to run concurrently. It sleeps and because it lacks yield statements before and after code in it, there are no points in time where it could yield control to any other code I've written. But when I add a scheduler, it appears to somehow yield to something. Is sleep detecting the scheduler and yielding to it so that my code resuming the fibers is immediately yielded to, making it able to immediately resume all five fibers?
Great question!
As #stefan noted above, Ruby 3.0 introduced the concept of a "non-blocking fiber." The way the actual non-blocking behavior is accomplished is left up to the scheduler implementation. There is no default scheduler as far as I know; per the Ruby docs:
If Fiber.scheduler is not set in the current thread, blocking and non-blocking fibers’ behavior is identical.
Now, to answer your last question:
But when I add a scheduler, it appears to somehow yield to something ... Is sleep detecting the scheduler and yielding to it so that my code resuming the fibers is immediately yielded to, making it able to immediately resume all five fibers?
You're onto something! When you set a fiber scheduler, it's expected to conform to Fiber::SchedulerInterface, which defines several "hooks." One of those hooks is #kernel_sleep, which is invoked by Kernel#sleep (and Mutex#sleep)!
I can't say I've read much libev code, but you can find libev_scheduler's implementation of that hook here.
The idea is (emphasis my own):
The scheduler runs into a wait loop, checking all the blocked fibers (which it has registered on hook calls) and resuming them when the awaited resource is ready (e.g. I/O ready or sleep time elapsed).
So, in summary:
Your fiber calls Kernel#sleep with some duration.
Kernel#sleep calls the scheduler's #kernel_sleep hook with that same duration.
The schedule "somehow registers what the current fiber is waiting on, and yields control to other fibers with Fiber.yield" (quote from the docs there)
"The scheduler runs into a wait loop, checking all the blocked fibers (which it has registered on hook calls) and resuming them when the awaited resource is ready (e.g. I/O ready or sleep time elapsed)."
Hope this helps!

Understanding Celluloid Pool

I guess my understanding toward Celluloid Pool is sort of broken. I will try to explain below but before that a quick note.
Note: Our system is running against a very fast client passing messages over ZeroMQ.
With the following Vanilla Celluloid app
class VanillaClient
include Celluloid::ZMQ
def read
loop { async.evaluate_response(socket.read_multipart)
end
def evaluate_response(data)
## the reason for using defer can be found over here.
Celluloid.defer do
ExternalService.execute(data)
end
end
end
Our system result in failure after some time, reason 'Can't spawn more thread' (or something like it)
So we intended to use Celluloid Pool(to avoid the above-mentioned problem ) so that we can limit the number of threads that spawned
My Understanding toward Celluloid Pool is
Celluloid Pool maintains a pool of actors for you so that you can distribute your task in parallel.
Hence, I decide to test it, but according to my test cases, it seems to behave serially(i.e thing never get distribute or happen in parallel.)
Example to replicate this.
sender-1.rb
## Send message `1` to the the_client.rb
sender-2.rb
## Send message `2` to the the_client.rb
the_client.rb
## take message from sender-1 and sender-2 and return it back to receiver.rb
## heads on, the `sleep` is introduced to test/replicate the IO block that happens in the actual code.
receiver.rb
## print the message obtained from the_client.rb
If, the sender-2.rb is run before sender-1.rb it appears that the pool gets blocked for 20 sec (sleep time in the_client.rb,can be seen over here) before consuming the data sent by sender-1.rb
It behaves the same in ruby-2.2.2 and under jRuby-9.0.5.0. What could be the possible causes for Pool to act in such manner?
Your pool call is not asynchronous.
Execution of evaluate on #pool needs to be .async still, as in your original example, not using pools. You still want asynchronous behavior, but you als want to have multiple handler actors.
Next you will likely hit the Pool.async bug.
https://github.com/celluloid/celluloid-pool/issues/6
This means after 5 hits to evaluate your pool will become unresponsive until at least one actor in the pool is finished. Worst case scenario, if you get 6+ requests in rapid succession, the 6th will then take 120 seconds, because it will take 5*20 seconds before it executes, then 20 seconds to execute itself.
Depending on what your actual operation is that's causing you delays -- you might need to adjust your pool size down the line.

Celluloid async inside ruby blocks does not work

Trying to implement Celluloid async on my working example seem to exhibit weird behavior.
here my code looks
class Indefinite
include Celluloid
def run!
loop do
[1].each do |i|
async.on_background
end
end
end
def on_background
puts "Running in background"
end
end
Indefinite.new.run!
but when I run the above code, I never see the puts "Running in Background"
But, if I put a sleep the code seem to work.
class Indefinite
include Celluloid
def run!
loop do
[1].each do |i|
async.on_background
end
sleep 0.5
end
end
def on_background
puts "Running in background"
end
end
Indefinite.new.run!
Any idea? why such a difference in the above two scenario.
Thanks.
Your main loop is dominating the actor/application's threads.
All your program is doing is spawning background processes, but never running them. You need that sleep in the loop purely to allow the background threads to get attention.
It is not usually a good idea to have an unconditional loop spawn infinite background processes like you have here. There ought to be either a delay, or a conditional statement put in there... otherwise you just have an infinite loop spawning things that never get invoked.
Think about it like this: if you put puts "looping" just inside your loop, while you do not see Running in the background ... you will see looping over and over and over.
Approach #1: Use every or after blocks.
The best way to fix this is not to use sleep inside a loop, but to use an after or every block, like this:
every(0.1) {
on_background
}
Or best of all, if you want to make sure the process runs completely before running again, use after instead:
def run_method
#running ||= false
unless #running
#running = true
on_background
#running = false
end
after(0.1) { run_method }
end
Using a loop is not a good idea with async unless there is some kind of flow control done, or a blocking process such as with #server.accept... otherwise it will just pull 100% of the CPU core for no good reason.
By the way, you can also use now_and_every as well as now_and_after too... this would run the block right away, then run it again after the amount of time you want.
Using every is shown in this gist:
https://gist.github.com/digitalextremist/686f42e58a58b743142b
The ideal situation, in my opinion:
This is a rough but immediately usable example:
https://gist.github.com/digitalextremist/12fc824c6a4dbd94a9df
require 'celluloid/current'
class Indefinite
include Celluloid
INTERVAL = 0.5
ONE_AT_A_TIME = true
def self.run!
puts "000a Instantiating."
indefinite = new
indefinite.run
puts "000b Running forever:"
sleep
end
def initialize
puts "001a Initializing."
#mutex = Mutex.new if ONE_AT_A_TIME
#running = false
puts "001b Interval: #{INTERVAL}"
end
def run
puts "002a Running."
unless ONE_AT_A_TIME && #running
if ONE_AT_A_TIME
#mutex.synchronize {
puts "002b Inside lock."
#running = true
on_background
#running = false
}
else
puts "002b Without lock."
on_background
end
end
puts "002c Setting new timer."
after(INTERVAL) { run }
end
def on_background
if ONE_AT_A_TIME
puts "003 Running background processor in foreground."
else
puts "003 Running in background"
end
end
end
Indefinite.run!
puts "004 End of application."
This will be its output, if ONE_AT_A_TIME is true:
000a Instantiating.
001a Initializing.
001b Interval: 0.5
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
000b Running forever:
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
002a Running.
002b Inside lock.
003 Running background processor in foreground.
002c Setting new timer.
And this will be its output if ONE_AT_A_TIME is false:
000a Instantiating.
001a Initializing.
001b Interval: 0.5
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
000b Running forever:
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
002a Running.
002b Without lock.
003 Running in background
002c Setting new timer.
You need to be more "evented" than "threaded" to properly issue tasks and preserve scope and state, rather than issue commands between threads/actors... which is what the every and after blocks provide. And besides that, it's good practice either way, even if you didn't have a Global Interpreter Lock to deal with, because in your example, it doesn't seem like you are dealing with a blocking process. If you had a blocking process, then by all means have an infinite loop. But since you're just going to end up spawning an infinite number of background tasks before even one is processed, you need to either use a sleep like your question started with, or use a different strategy altogether, and use every and after which is how Celluloid itself encourages you to operate when it comes to handling data on sockets of any kind.
Approach #2: Use a recursive method call.
This just came up in the Google Group. The below example code will actually allow execution of other tasks, even though it's an infinite loop.
https://groups.google.com/forum/#!topic/celluloid-ruby/xmkdrMQBGbY
This approach is less desirable because it will likely have more overhead, spawning a series of fibers.
def work
# ...
async.work
end
Question #2: Thread vs. Fiber behaviors.
The second question is why the following would work: loop { Thread.new { puts "Hello" } }
That spawns an infinite number of process threads, which are managed by the RVM directly. Even though there is a Global Interpreter Lock in the RVM you are using... that only means no green threads are used, which are provided by the operating system itself... instead these are handled by the process itself. The CPU scheduler for the process runs each Thread itself, without hesitation. And in the case of the example, the Thread runs very quickly and then dies.
Compared to an async task, a Fiber is used. So what's happening is this, in the default case:
Process starts.
Actor instantiated.
Method call invokes loop.
Loop invokes async method.
async method adds task to mailbox.
Mailbox is not invoked, and loop continues.
Another async task is added to the mailbox.
This continues infinitely.
The above is because the loop method itself is a Fiber call, which is not ever being suspended ( unless a sleep is called! ) and therefore the additional task added to the mailbox is never an invoking a new Fiber. A Fiber behaves differently than a Thread. This is a good piece of reference material discussing the differences:
https://blog.engineyard.com/2010/concurrency-real-and-imagined-in-mri-threads
Question #3: Celluloid vs. Celluloid::ZMQ behavior.
The third question is why include Celluloid behaves differently than Celluloid::ZMQ ...
That's because Celluloid::ZMQ uses a reactor-based evented mailbox, versus Celluloid which uses a condition variable based mailbox.
Read more about pipelining and execution modes:
https://github.com/celluloid/celluloid/wiki/Pipelining-and-execution-modes
That is the difference between the two examples. If you have additional questions about how these mailboxes behave, feel free to post on the Google Group ... the main dynamic you are facing is the unique nature of the GIL interacting with the Fiber vs. Thread vs. Reactor behavior.
You can read more about the reactor-pattern here:
http://en.wikipedia.org/wiki/Reactor_pattern
Explanation of the "Reactor pattern"
What is the difference between event driven model and reactor pattern?
And see the specific reactor used by Celluloid::ZMQ here:
https://github.com/celluloid/celluloid-zmq/blob/master/lib/celluloid/zmq/reactor.rb
So what's happening in the evented mailbox scenario, is that when sleep is hit, that is a blocking call, which causes the reactor to move to the next task in the mailbox.
But also, and this is unique to your situation, the specific reactor being used by Celluloid::ZMQ is using an eternal C library... specifically the 0MQ library. That reactor is external to your application, which behaves differently than Celluloid::IO or Celluloid itself, and that is also why the behavior is occurring differently than you expected.
Multi-core Support Alternative
If maintaining state and scope is not important to you, if you use jRuby or Rubinius which are not limited to one operating system thread, versus using MRI which has the Global Interpreter Lock, you can instantiate more than one actor and issue async calls between actors concurrently.
But my humble opinion is that you would be much better served using a very high frequency timer, such as 0.001 or 0.1 in my example, which will seem instantaneous for all intents and purposes, but also allow the actor thread plenty of time to switch fibers and run other tasks in the mailbox.
Let's make an experiment, by modifying your example a bit (we modify it because this way we get the same "weird" behaviour, while making things clearner):
class Indefinite
include Celluloid
def run!
(1..100).each do |i|
async.on_background i
end
puts "100 requests sent from #{Actor.current.object_id}"
end
def on_background(num)
(1..100000000).each {}
puts "message #{num} on #{Actor.current.object_id}"
end
end
Indefinite.new.run!
sleep
# =>
# 100 requests sent from 2084
# message 1 on 2084
# message 2 on 2084
# message 3 on 2084
# ...
You can run it on any Ruby interpreter, using Celluloid or Celluloid::ZMQ, the result always will be the same. Also note that, output from Actor.current.object_id is the same in both methods, giving us the clue, that we are dealing with a single actor in our experiment.
So there is not much difference between ruby and Celluloid implementations, as long as this experiment is concerned.
Let's first address why this code behaves in this way?
It's not hard to understand why it's happening. Celluloid is receiving incoming requests and saving them in the queue of tasks for appropriate actor. Note, that our original call to run! is on the top of the queue.
Celluloid then processes those tasks, one at a time. If there happens to be a blocking call or sleep call, according to the documentation, the next task will be invoked, not waiting for the current task to be completed.
Note, that in our experiment there are no blocking calls. It means, that the run! method will be executed from the beginning to the end, and only after it's done, each of the on_background calls will be invoked in the perfect order.
And it's how it's supposed to work.
If you add sleep call in your code, it will notify Celluloid, that it should start processing of the next task in queue. Thus, the behavior, you have in your second example.
Let's now continue to the part on how to design the system, so that it does not depend on sleep calls, which is weird at least.
Actually there is a good example at Celluloid-ZMQ project page. Note this loop:
def run
loop { async.handle_message #socket.read }
end
The first thing it does is #socket.read. Note that it's a blocking operation. So, Celluloid will process to the next message in the queue (if there are any). As soon as #socket.read responds, a new task will be generated. But this task won't be executed before #socket.read is called again, thus blocking execution, and notifying Celluloid to process with the next item on the queue.
You probably see the difference with your example. You are not blocking anything, thus not giving Celluloid a chance to process with queue.
How can we get behavior given in Celluloid::ZMQ example?
The first (in my opinion, better) solution is to have actual blocking call, like #socket.read.
If there are no blocking calls in your code and you still need to process things in background, then you should consider other mechanisms provided by Celluloid.
There are several options with Celluloid.
One can use conditions, futures, notifications, or just calling wait/signal on low level, like in this example:
class Indefinite
include Celluloid
def run!
loop do
async.on_background
result = wait(:background) #=> 33
end
end
def on_background
puts "background"
# notifies waiters, that they can continue
signal(:background, 33)
end
end
Indefinite.new.run!
sleep
# ...
# background
# background
# background
# ...
Using sleep(0) with Celluloid::ZMQ
I also noticed working.rb file you mentioned in your comment. It contains the following loop:
loop { [1].each { |i| async.handle_message 'hello' } ; sleep(0) }
It looks like it's doing the proper job. Actually, running it under jRuby revealed, it's leaking memory. To make it even more apparent, try to add a sleep call into the handle_message body:
def handle_message(message)
sleep 0.5
puts "got message: #{message}"
end
High memory usage is probably related to the fact, that queue is filled very fast and cannot be processed in given time. It will be more problematic, if handle_message is more work-intensive, then it's now.
Solutions with sleep
I'm skeptical about solutions with sleep. They potentially require much memory and even generate memory leaks. And it's not clear what should you pass as a parameter to the sleep method and why.
How threads work with Celluloid
Celluloid is not creating a new thread for each asynchronous task. It has a pool of threads in which it runs every task, synchronous and asynchronous ones. The key point is that the library sees the run! function as a synchronous task, and performs it in the same context than an asynchronous task.
By default, Celluloid runs everything in a single thread, using a queue system to schedule asynchronous tasks for later. It creates new threads only when needed.
Besides that, Celluloid overrides the sleep function. It means that every time you call sleep in a class extending the Celluloid class, the library will check if there are non-sleeping threads in its pool.
In your case, the first time you call sleep 0.5, it will create a new Thread to perform the asynchronous tasks in the queue while the first thread is sleeping.
So in your first example, only one Celluloid thread is running, performing the loop. In your second example, two Celluloid threads are running, the first one performing the loop and sleeping at each iteration, the other one performing the background task.
You could for instance change your first example to perform a finite number of iterations:
def run!
(0..100).each do
[1].each do |i|
async.on_background
end
end
puts "Done!"
end
When using this run! function, you'll see that Done! is printed before all the Running in background, meaning that Celluloid finishes the execution of the run! function before starting the asynchronous tasks in the same thread.

Send multiply messages in websocket using threads

I'm making a Ruby server using the em-websocket gem. When a client sends some message (e.g. "thread") the server creates two different threads and sends two anwsers to the client in parallel (I'm actually studying multithreading and websockets). Here's my code:
EM.run {
EM::WebSocket.run(:host => "0.0.0.0", :port => 8080) do |ws|
ws.onmessage { |msg|
puts "Recieved message: #{msg}"
if msg == "thread"
threads = []
threads << a = Thread.new {
sleep(1)
puts "1"
ws.send("Message sent from thread 1")
}
threads << b = Thread.new{
sleep(2)
puts "2"
ws.send("Message sent from thread 2")
}
threads.each { |aThread| aThread.join }
end
How it executes:
I'm sending "thread" message to a server
After one second in my console I see printed string "1". After another second I see "2".
Only after that both messages simultaneously are sent to the client.
The problem is that I want to send messages exactly at the same time when debug output "1" and "2" are sent.
My Ruby version is 1.9.3p194.
I don't have experience with EM, so take this with a pinch of salt.
However, at first glance, it looks like "aThread.join" is actually blocking the "onmessage" method from completing and thus also preventing the "ws.send" from being processed.
Have you tried removing the "threads.each" block?
Edit:
After having tested this in arch linux with both ruby 1.9.3 and 2.0.0 (using "test.html" from the examples of em-websocket), I am sure that even if removing the "threads.each" block doesn't fix the problem for you, you will still have to remove it as Thread#join will suspend the current thread until the "joined" threads are finished.
If you follow the function call of "ws.onmessage" through the source code, you will end up at the Connection#send_data method of the Eventmachine module and find the following within the comments:
Call this method to send data to the remote end of the network connection. It takes a single String argument, which may contain binary data. Data is buffered to be sent at the end of this event loop tick (cycle).
As "onmessage" is blocked by the "join" until both "send" methods have run, the event loop tick cannot finish until both sets of data are buffered and thus, all the data cannot be sent until this time.
If it is still not working for you after removing the "threads.each" block, make sure that you have restarted your eventmachine and try setting the second sleep to 5 seconds instead. I don't know how long a typical event loop takes in eventmachine (and I can't imagine it to be as long as a second), however, the documentation basically says that if several "send" calls are made within the same tick, they will all be sent at the same time. So increasing the time difference will make sure that this is not happening.
I think the problem is that you are calling sleep method, passing 1 to the first thread and 2 to the second thread.
Try removing sleep call on both threads or passing the same value on each call.

Recovering cleanly from Resque::TermException or SIGTERM on Heroku

When we restart or deploy we get a number of Resque jobs in the failed queue with either Resque::TermException (SIGTERM) or Resque::DirtyExit.
We're using the new TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 in our Procfile so our worker line looks like:
worker: TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=critical,high,low
We're also using resque-retry which I thought might auto-retry on these two exceptions? But it seems to not be.
So I guess two questions:
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Thanks!
Edit: Getting all jobs to complete in less than 10 seconds seems unreasonable at scale. It seems like there needs to be a way to automatically re-queue these jobs when the Resque::DirtyExit exception is run.
I ran into this issue as well. It turns out that Heroku sends the SIGTERM signal to not just the parent process but all forked processes. This is not the logic that Resque expects which causes the RESQUE_PRE_SHUTDOWN_TIMEOUT to be skipped, forcing jobs to executed without any time to attempt to finish a job.
Heroku gives workers 30s to gracefully shutdown after a SIGTERM is issued. In most cases, this is plenty of time to finish a job with some buffer time left over to requeue the job to Resque if the job couldn't finish. However, for all of this time to be used you need to set the RESQUE_PRE_SHUTDOWN_TIMEOUT and RESQUE_TERM_TIMEOUT env vars as well as patch Resque to correctly respond to SIGTERM being sent to forked processes.
Here's a gem which patches resque and explains this issue in more detail:
https://github.com/iloveitaly/resque-heroku-signals
This will be a two part answer, first addressing Resque::TermException and then Resque::DirtyExit.
TermException
It's worth noting that if you are using ActiveJob with Rails 7 or later the retry_on and discard_on methods can be used to handle Resque::TermException. You could write the following in your job class:
retry_on(::Resque::TermException, wait: 2.minutes, attempts: 4)
or
discard_on(::Resque::TermException)
A big caveat here is that if you are using a Rails version prior to 7 you'll need to add some custom code to get this to work.
The reason is that Resque::TermException does not inherit from StandardError (it inherits from SignalException, source: https://github.com/resque/resque/blob/master/lib/resque/errors.rb#L26) and prior to Rails 7 retry_on and discard_on only handle exceptions that inherit from StandardError.
Here's the Rails 7 commit that changes this to work with all exception subclasses: https://github.com/rails/rails/commit/142ae54e54ac81a0f62eaa43c3c280307cf2127a
So if you want to use retry_on to handle Resque::TermException on a Rails version earlier than 7 you have a few options:
Monkey patch TermException so that it inherits from StandardError.
Add a rescue statement to your perform method that explicitly looks for Resque::TermException or one of its ancestors (eg SignalException, Exception).
Patch the implementation of perform_now with the Rails 7 version (this is what I did in my codebase).
Here's how you can retry on a TermException by adding a rescue to your job's perform method:
class MyJob < ActiveJob::Base
prepend RetryOnTermination
# ActiveJob's `retry_on` and `discard_on` methods don't handle
`TermException`
# because it inherits from `SignalException` rather than `StandardError`.
module RetryOnTermination
def perform(*args, **kwargs)
super
rescue Resque::TermException
Rails.logger.info("Retrying #{self.class.name} due to Resque::TermException")
self.class.set(wait: 2.minutes).perform_later(*args, **kwargs)
end
end
end
Alternatively you can use the Rails 7 definition of perform_now by adding this to your job class:
# FIXME: Here we override the Rails 6 implementation of this method with the
# Rails 7 implementation in order to be able to retry/discard exceptions that
# don't inherit from StandardError, such as `Resque::TermException`.
#
# When we upgrade to Rails 7 we should remove this.
# Latest stable Rails (7 as of this writing) source: https://github.com/rails/rails/blob/main/activejob/lib/active_job/execution.rb
# Rails 6.1 source: https://github.com/rails/rails/blob/6-1-stable/activejob/lib/active_job/execution.rb
# Rails 6.0 source (same code as 6.1): https://github.com/rails/rails/blob/6-0-stable/activejob/lib/active_job/execution.rb
#
# NOTE: I've made a minor change to the Rails 7 implementation, I've removed
# the line `ActiveSupport::ExecutionContext[:job] = self`, because `ExecutionContext`
# isn't defined prior to Rails 7.
def perform_now
# Guard against jobs that were persisted before we started counting executions by zeroing out nil counters
self.executions = (executions || 0) + 1
deserialize_arguments_if_needed
run_callbacks :perform do
perform(*arguments)
end
rescue Exception => exception
rescue_with_handler(exception) || raise
end
DirtyExit
Resque::DirtyExit is raised in the parent process, rather than the forked child process that actually executes your job code. This means that any code you have in your job for rescuing or retrying those exceptions won't work. See these lines of code where that happens:
https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L940
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L234
https://github.com/resque/resque/blob/master/lib/resque/job.rb#L285
But fortunately, Resque provides a mechanism for dealing with this, job hooks, specifically the on_failure hook: https://github.com/resque/resque/blob/master/docs/HOOKS.md#job-hooks
A quote from those docs:
on_failure: Called with the exception and job args if any exception occurs while performing the job (or hooks), this includes Resque::DirtyExit.
And an example from those docs on how to use hooks to retry exceptions:
module RetriedJob
def on_failure_retry(e, *args)
Logger.info "Performing #{self} caused an exception (#{e}). Retrying..."
Resque.enqueue self, *args
end
end
class MyJob
extend RetriedJob
end
We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do
this for all jobs? Even a monkey patch.
The Resque::DirtyExit exception is raised when the job is killed with the SIGTERM signal. The job does not have the opportunity to catch the exception as you can read here.
Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?
Don't see why it shouldn't, is the scheduler running? If not rake resque:scheduler.
I wrote a detailed blog post around some of the problems I had recently with Resque::DirtyExit, maybe it is useful => Understanding the Resque internals – Resque::DirtyExit unveiled
I've also struggled with this for awhile without finding a reliable solution.
One of the few solutions I've found is running a rake task on a schedule (cron job every 1 minute) which looks for jobs failing with Resque::DirtyExit, retries these specific jobs and removes these jobs from the failure queue.
Here's a sample of the rake task
https://gist.github.com/CharlesP/1818418754aec03403b3
This solution is clearly suboptimal but to date it's the best solution I've found to retry these jobs.
Are your resque jobs taking longer than 10 seconds to complete? If the jobs complete within 10 seconds after the initial SIGTERM is sent you should be fine. Try to break up the jobs into smaller chunks that finish quicker.
Also, you can have your worker re-enqueue the job doing something like this: https://gist.github.com/mrrooijen/3719427

Resources