How can I tell unicorn to understand Heroku's signals? - ruby

Perhaps you've seen this...
2012-03-07T15:36:25+00:00 heroku[web.1]: Stopping process with SIGTERM
2012-03-07T15:36:36+00:00 heroku[web.1]: Stopping process with SIGKILL
2012-03-07T15:36:36+00:00 heroku[web.1]: Error R12 (Exit timeout) -> Process failed to exit within 10 seconds of SIGTERM
2012-03-07T15:36:38+00:00 heroku[web.1]: Process exited with status 137
This is a well known problem when running unicorn on heroku...
heroku uses SIGTERM for graceful shutdown
unicorn uses SIGTERM for quick shutdown
Can I tell heroku to send SIGQUIT? Or can I tell unicorn to treat SIGTERM as graceful shutdown?

Heroku now provides instruction for this here:
https://blog.heroku.com/archives/2013/2/27/unicorn_rails
Their suggested unicorn.rb file is:
# config/unicorn.rb
worker_processes 3
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.connection.disconnect!
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT'
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.establish_connection
end

This is a hack, but I've successfully created a unicorn config file that traps the TERM signal, preventing unicorn from receiving it and performing its quick shutdown. My signal handler then sends QUIT signal back to itself to trigger the unicorn graceful shutdown.
Tested with Ruby 1.9.2, Unicorn 4.0.1 and 4.2.1, Mac OS X.
listen 9292
worker_processes 1
# This is a hack. The code is run with 'before_fork' so it runs
# *after* Unicorn installs its own TERM signal handler (which makes
# this highly dependent on the Unicorn implementation details).
#
# We install our own signal handler for TERM and simply re-send a QUIT
# signal to our self.
before_fork do |_server, _worker|
Signal.trap 'TERM' do
puts 'intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
end
One concern is that (I believe) this signal handler is inherited by worker processes. But, the worker process installs its own TERM handler, which should overwrite this one, so I would not expect any issue. (See Unicorn::HttpServer#init_worker_process # lib/unicorn/http_server.rb:551.
Edit: one more detail, this block that installs the signal handler will run once per worker process (because before_fork), but this merely redundant and won't affect anything.

Related

[Ruby 1.9][Windows] Sending Ctrl-C interrupt signal to a spawned subprocess

I have a main script in Ruby 1.9.3 running on Windows. It's will start another Ruby script that runs as a daemon, do its own stuff, then end the daemon by sending an "INT" signal. The main script and daemon don't otherwise do any data exchange.
The daemon itself can run as a standalone, and we terminate it with Ctrl-C. Here's the part that prepares it for the signal:
def setup_ctrl_c_to_quit
Thread.new do
trap("INT") do
puts "got INT signal"
exit
end
while true
sleep 1
end
end
end
I am currently having trouble having the main script launching and terminating the daemon. Currently, I can start the daemon through spawn and detach as such:
def startDaemon
#daemonPID = spawn("ruby c:/some_folder/daemon.rb", :new_pgroup=>true, :err=>:out)
puts "DaemonPID #{#daemonPID}"
daemonDetatch = Process.detach(#daemonPID)
puts "Detached Daemon . Entering sleep...."
sleep 15
puts "Is daemon detached thread alive? => #{daemonDetatch.alive?}"
puts "Attempt to kill daemon...."
Process.kill( "INT", #daemonPID )
sleep 5
puts "Is daemon detached thread still alive? => #{daemonDetatch.alive?}"
end
Ideally, the last puts statement should show daemonDetatch.alive? to be false. In reality, not only does daemonDetatch.alive? still ended up being true by the end, the daemon also can be found as still running in both the Task Manager and other 3rd party apps such as Process Explorer.
The first question I have is with the spawn(...) function. The official documentation said that :new_pgroup "is necessary for Process.kill(:SIGINT, pid) on the subprocess" send it determines whether the subprocess becomes a new group or not. I've toggled with this paramter, but it didn't seem to make a difference.
Also, I am planning to give this solution a try, which involves using the win32-process gem. I am just wondering if there are other solutions out there.
[Edit]
I have validated the PID of the daemon obtained in the main script, the daemon itself (with $$), and Process Explore, and they are all the same.
I have gotten suggestion from many others to just use "taskkill /f" to terminate the daemon. That will indeed end the daemon, but the daemon cannot trap the "TERM" or "KILL" signals the same way it traps "INT", meaning it will be unable to run its clean-up/quit routine.

Unicorn exit timeout on Heroku after trapping TERM and sending QUIT

I am receiving R12 Exit Timeout errors for a Heroku app running unicorn and sidekiq. These errors occur 1-2 times a day and whenever I deploy. I understand that I need to convert the shutdown signals from Heroku for unicorn to respond correctly, but thought that I had done so in the below unicorn config:
worker_processes 3
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts "Unicorn master intercepting TERM and sending myself QUIT instead. My PID is #{Process.pid}"
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
Rails.logger.info('Disconnected from ActiveRecord')
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts "Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is #{Process.pid}"
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.establish_connection
Rails.logger.info('Connected to ActiveRecord')
end
Sidekiq.configure_client do |config|
config.redis = { :size => 1 }
end
end
My logs surrounding the error look like this:
Stopping all processes with SIGTERM
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 7
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 11
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 15
Unicorn master intercepting TERM and sending myself QUIT instead. My PID is 2
Started GET "/manage"
reaped #<Process::Status: pid 11 exit 0> worker=1
reaped #<Process::Status: pid 7 exit 0> worker=0
reaped #<Process::Status: pid 15 exit 0> worker=2
master complete
Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
Stopping remaining processes with SIGKILL
Process exited with status 137
It appears that all of the child processes were successfully reaped before the timeout. Is it possible master is still alive? Also, should the router still be sending web requests to the dyno during shut down, as shown in the logs?
FWIW, I'm using Heroku's zero downtime deployment plugin (https://devcenter.heroku.com/articles/labs-preboot/).
I think your custom signal handling is what's causing the timeouts here.
EDIT: I'm getting downvoted for disagreeing with Heroku's documentation and I'd like to address this.
Configuring your Unicorn application to catch and swallow the TERM signal is the most likely cause of your application hanging and not shutting down correctly.
Heroku seems to argue that catching and transforming a TERM signal into a QUIT signal is the right behavior to turn a hard shutdown into a graceful shutdown.
However, doing this seems to introduce the risk of no shutdown at all in some cases - the root of this bug. Users experiencing hanging dynos running Unicorn should consider the evidence and make their own decision based on first principles, not just documentation.

Unicorns enter a restart loop on Heroku

I have a rails application deployed to Heroku Celadon cedar stack using Unicorn (4.5.0) with the following unicorn.rb file:
worker_processes 2 # amount of unicorn workers to spin up
timeout 30 # restarts workers that hang for 30 seconds
check_client_connection true
At seemingly random times without any noticable change in services it uses(including DB) unicorns will enter a restart loop. They will keep restarting with the following typical error:
ERROR -- : worker=0 PID:935 timeout (31s > 30s), killing
The problem is that is keeps restarting more often than every 30 seconds per unicorn worker. It stops when the underlying dyno gets restarted so I'm guessing it has something to do with the way unicorn master process and heroku interact.
Anyone else experiencing this or has any ideas as to what could be the cause?
You should not use the check_client_connection true option.
According to Heroku Unicorn documentation you should use a configuration file like this:
# config/unicorn.rb
worker_processes 3
timeout 15
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.connection.disconnect!
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.establish_connection
end

After Process.wait, $?.exited? is false

If I run the following:
Process.kill "KILL", pid
Process.wait pid
raise "application did not exit" if not $?.exited?
raise "application failed" if not $?.success?
I get the error "application did not exit". Why is Process.wait not waiting? More precisely, why does Process.wait set $? to a not-exited status?
The process exited immediately after receiving the kill signal, so Process.wait immediately assigns the appropriate Process::Status object to $? and allows execution to continue. Process::Status#exited? only returns true when the process exited normally, so what you are seeing is expected behavior -- SIGKILL generally does not cause normal termination.
From the Process::Status documentation:
Posix systems record information on processes using a 16-bit
integer. The lower bits record the process status (stopped,
exited, signaled) and the upper bits possibly contain additional
information.
A status of exited would mean the process exited by itself (the process called exit()). But since you forcibly killed the process it will have a status of signaled instead, meaning the process exited because of an uncaught signal (KILL in this case).

Use `reload` instead of `restart` for Unicorn?

I'm a little confused about my deploy strategy here, when deploying under what circumstances would I want to send a reload signal to unicorn? For example in my case it would be like:
sudo kill -s USR2 `cat /home/deploy/apps/my_app/current/tmp/pids/unicorn.pid`
I've been deploying my apps by killing that pid, then starting unicorn again via something like:
bundle exec unicorn -c config/unicorn/production.rb -E production -D
I'm just wondering why I'd want to use reload? Can I gain any performance for my deployment by doing so?
When you kill unicorn you cause downtime, until unicorn can start back up. When you use the USR2 signal, unicorn starts new workers first, then once they are running, it kills the old workers. It's basically all about removing the need to "turn off" unicorn.
Note, the assumption is that you have the documented before_fork hook in your unicorn configuration, in order to handle the killing of the old workers, should an ".oldbin" file be found, containing the PID of the old unicorn process:
before_fork do |server, worker|
# a .oldbin file exists if unicorn was gracefully restarted with a USR2 signal
# we should terminate the old process now that we're up and running
old_pid = "#{pids_dir}/unicorn.pid.oldbin"
if File.exists?(old_pid)
begin
Process.kill("QUIT", File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
# someone else did our job for us
end
end
end

Resources