Unicorns enter a restart loop on Heroku - heroku

I have a rails application deployed to Heroku Celadon cedar stack using Unicorn (4.5.0) with the following unicorn.rb file:
worker_processes 2 # amount of unicorn workers to spin up
timeout 30 # restarts workers that hang for 30 seconds
check_client_connection true
At seemingly random times without any noticable change in services it uses(including DB) unicorns will enter a restart loop. They will keep restarting with the following typical error:
ERROR -- : worker=0 PID:935 timeout (31s > 30s), killing
The problem is that is keeps restarting more often than every 30 seconds per unicorn worker. It stops when the underlying dyno gets restarted so I'm guessing it has something to do with the way unicorn master process and heroku interact.
Anyone else experiencing this or has any ideas as to what could be the cause?

You should not use the check_client_connection true option.
According to Heroku Unicorn documentation you should use a configuration file like this:
# config/unicorn.rb
worker_processes 3
timeout 15
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.connection.disconnect!
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.establish_connection
end

Related

[Ruby 1.9][Windows] Sending Ctrl-C interrupt signal to a spawned subprocess

I have a main script in Ruby 1.9.3 running on Windows. It's will start another Ruby script that runs as a daemon, do its own stuff, then end the daemon by sending an "INT" signal. The main script and daemon don't otherwise do any data exchange.
The daemon itself can run as a standalone, and we terminate it with Ctrl-C. Here's the part that prepares it for the signal:
def setup_ctrl_c_to_quit
Thread.new do
trap("INT") do
puts "got INT signal"
exit
end
while true
sleep 1
end
end
end
I am currently having trouble having the main script launching and terminating the daemon. Currently, I can start the daemon through spawn and detach as such:
def startDaemon
#daemonPID = spawn("ruby c:/some_folder/daemon.rb", :new_pgroup=>true, :err=>:out)
puts "DaemonPID #{#daemonPID}"
daemonDetatch = Process.detach(#daemonPID)
puts "Detached Daemon . Entering sleep...."
sleep 15
puts "Is daemon detached thread alive? => #{daemonDetatch.alive?}"
puts "Attempt to kill daemon...."
Process.kill( "INT", #daemonPID )
sleep 5
puts "Is daemon detached thread still alive? => #{daemonDetatch.alive?}"
end
Ideally, the last puts statement should show daemonDetatch.alive? to be false. In reality, not only does daemonDetatch.alive? still ended up being true by the end, the daemon also can be found as still running in both the Task Manager and other 3rd party apps such as Process Explorer.
The first question I have is with the spawn(...) function. The official documentation said that :new_pgroup "is necessary for Process.kill(:SIGINT, pid) on the subprocess" send it determines whether the subprocess becomes a new group or not. I've toggled with this paramter, but it didn't seem to make a difference.
Also, I am planning to give this solution a try, which involves using the win32-process gem. I am just wondering if there are other solutions out there.
[Edit]
I have validated the PID of the daemon obtained in the main script, the daemon itself (with $$), and Process Explore, and they are all the same.
I have gotten suggestion from many others to just use "taskkill /f" to terminate the daemon. That will indeed end the daemon, but the daemon cannot trap the "TERM" or "KILL" signals the same way it traps "INT", meaning it will be unable to run its clean-up/quit routine.

Starting or restarting Unicorn with Capistrano 3.x

I'm trying to start or restart Unicorn when I do cap production deploy with Capistrano 3.0.1. I have some examples that I got working with Capistrano 2.x using something like:
namespace :unicorn do
desc "Start unicorn for this application"
task :start do
run "cd #{current_path} && bundle exec unicorn -c /etc/unicorn/myapp.conf.rb -D"
end
end
But when I try and use run in the deploy.rb for Capistrano 3.x I get an undefined method error.
Here are a couple of the things I tried:
# within the :deploy I created a task that I called after :finished
namespace :deploy do
...
task :unicorn do
run "cd #{current_path} && bundle exec unicorn -c /etc/unicorn/myapp.conf.rb -D"
end
after :finished, 'deploy:unicorn'
end
I have also tried putting the run within the :restart task
namespace :deploy do
desc 'Restart application'
task :restart do
on roles(:app), in: :sequence, wait: 5 do
# Your restart mechanism here, for example:
# execute :touch, release_path.join('tmp/restart.txt')
execute :run, "cd #{current_path} && bundle exec unicorn -c /etc/unicorn/deployrails.conf.rb -D"
end
end
If I use just run "cd ... " then I'll get awrong number of arguments (1 for 0)` in the local shell.
I can start the unicorn process with unicorn -c /etc/unicorn/deployrails.conf.rb -D from my ssh'd VM shell.
I can kill the master Unicorn process from the VM shell using kill USR2, but even though the process is killed I get an error. I can then start the process again using unicorn -c ...
$ kill USR2 58798
bash: kill: USR2: arguments must be process or job IDs
I'm very new to Ruby, Rails and Deployment in general. I have a VirtualBox setup with Ubuntu, Nginx, RVM and Unicorn, I'm pretty excited so far, but this one is really messing with me, any advice or insight is appreciated.
I'm using following code:
namespace :unicorn do
desc 'Stop Unicorn'
task :stop do
on roles(:app) do
if test("[ -f #{fetch(:unicorn_pid)} ]")
execute :kill, capture(:cat, fetch(:unicorn_pid))
end
end
end
desc 'Start Unicorn'
task :start do
on roles(:app) do
within current_path do
with rails_env: fetch(:rails_env) do
execute :bundle, "exec unicorn -c #{fetch(:unicorn_config)} -D"
end
end
end
end
desc 'Reload Unicorn without killing master process'
task :reload do
on roles(:app) do
if test("[ -f #{fetch(:unicorn_pid)} ]")
execute :kill, '-s USR2', capture(:cat, fetch(:unicorn_pid))
else
error 'Unicorn process not running'
end
end
end
desc 'Restart Unicorn'
task :restart
before :restart, :stop
before :restart, :start
end
Can't say anything specific about capistrano 3(i use 2), but i think this may help: How to run shell commands on server in Capistrano v3?.
Also i can share some unicorn-related experience, hope this helps.
I assume you want 24/7 graceful restart approach.
Let's consult unicorn documentation for this matter. For graceful restart(without downtime) you can use two strategies:
kill -HUP unicorn_master_pid It requires your app to have 'preload_app' directive disabled, increasing starting time of every one of unicorn workers. If you can live with that - go on, it's your call.
kill -USR2 unicorn_master_pid
kill -QUIT unicorn_master_pid
More sophisticated approach, when you're already dealing with performance concerns. Basically it will reexecute unicorn master process, then you should kill it's predecessor. Theoretically you can deal with usr2-sleep-quit approach. Another(and the right one, i may say) way is to use unicorn before_fork hook, it will be executed, when new master process will be spawned and will try to for new children for itself.
You can put something like this in config/unicorn.rb:
# Where to drop a pidfile
pid project_home + '/tmp/pids/unicorn.pid'
before_fork do |server, worker|
server.logger.info("worker=#{worker.nr} spawning in #{Dir.pwd}")
# graceful shutdown.
old_pid_file = project_home + '/tmp/pids/unicorn.pid.oldbin'
if File.exists?(old_pid_file) && server.pid != old_pid_file
begin
old_pid = File.read(old_pid_file).to_i
server.logger.info("sending QUIT to #{old_pid}")
# we're killing old unicorn master right there
Process.kill("QUIT", old_pid)
rescue Errno::ENOENT, Errno::ESRCH
# someone else did our job for us
end
end
end
It's more or less safe to kill old unicorn when the new one is ready to fork workers. You won't get any downtime that way and old unicorn will wait for it's workers to finish.
And one more thing - you may want to put it under runit or init supervision. That way your capistrano tasks will be as simple as sv reload unicorn, restart unicorn or /etc/init.d/unicorn restart. This is good thing.
I'm just going to throw this in the ring: capistrano 3 unicorn gem
However, my issue with the gem (and any approach NOT using an init.d script), is that you may now have two methods of managing your unicorn process. One with this cap task and one with init.d scripts. Things like Monit / God will get confused and you may spend hours debugging why you have two unicorn processes trying to start, and then you may start to hate life.
Currently I'm using the following with capistrano 3 and unicorn:
namespace :unicorn do
desc 'Restart application'
task :restart do
on roles(:app) do
puts "restarting unicorn..."
execute "sudo /etc/init.d/unicorn_#{fetch(:application)} restart"
sleep 5
puts "whats running now, eh unicorn?"
execute "ps aux | grep unicorn"
end
end
end
The above is combined with the preload_app: true and the before_fork and after_fork statements mentioned by #dredozubov
Note I've named my init.d/unicorn script unicorn_application_name.
The new worker that is started should kill off the old one. You can see with ps aux | grep unicorn that the old master hangs around for a few seconds before it disappears.
To view all caps:
cap -T
and it shows:
***
cap unicorn:add_worker # Add a worker (TTIN)
cap unicorn:duplicate # Duplicate Unicorn; alias of unicorn:re...
cap unicorn:legacy_restart # Legacy Restart (USR2 + QUIT); use this...
cap unicorn:reload # Reload Unicorn (HUP); use this when pr...
cap unicorn:remove_worker # Remove a worker (TTOU)
cap unicorn:restart # Restart Unicorn (USR2); use this when ...
cap unicorn:start # Start Unicorn
cap unicorn:stop # Stop Unicorn (QUIT)
***
So, to start unicorn in production:
cap production unicorn:start
and restart:
cap production unicorn:restart
PS do not forget to correct use gem capistrano3-unicorn
https://github.com/tablexi/capistrano3-unicorn
You can try to use native capistrano way as written here:
If preload_app:true and you need capistrano to cleanup your oldbin pid use:
after 'deploy:publishing', 'deploy:restart'
namespace :deploy do
task :restart do
invoke 'unicorn:legacy_restart'
end
end

Unicorn exit timeout on Heroku after trapping TERM and sending QUIT

I am receiving R12 Exit Timeout errors for a Heroku app running unicorn and sidekiq. These errors occur 1-2 times a day and whenever I deploy. I understand that I need to convert the shutdown signals from Heroku for unicorn to respond correctly, but thought that I had done so in the below unicorn config:
worker_processes 3
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts "Unicorn master intercepting TERM and sending myself QUIT instead. My PID is #{Process.pid}"
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
Rails.logger.info('Disconnected from ActiveRecord')
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts "Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is #{Process.pid}"
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.establish_connection
Rails.logger.info('Connected to ActiveRecord')
end
Sidekiq.configure_client do |config|
config.redis = { :size => 1 }
end
end
My logs surrounding the error look like this:
Stopping all processes with SIGTERM
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 7
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 11
Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT. My PID is 15
Unicorn master intercepting TERM and sending myself QUIT instead. My PID is 2
Started GET "/manage"
reaped #<Process::Status: pid 11 exit 0> worker=1
reaped #<Process::Status: pid 7 exit 0> worker=0
reaped #<Process::Status: pid 15 exit 0> worker=2
master complete
Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
Stopping remaining processes with SIGKILL
Process exited with status 137
It appears that all of the child processes were successfully reaped before the timeout. Is it possible master is still alive? Also, should the router still be sending web requests to the dyno during shut down, as shown in the logs?
FWIW, I'm using Heroku's zero downtime deployment plugin (https://devcenter.heroku.com/articles/labs-preboot/).
I think your custom signal handling is what's causing the timeouts here.
EDIT: I'm getting downvoted for disagreeing with Heroku's documentation and I'd like to address this.
Configuring your Unicorn application to catch and swallow the TERM signal is the most likely cause of your application hanging and not shutting down correctly.
Heroku seems to argue that catching and transforming a TERM signal into a QUIT signal is the right behavior to turn a hard shutdown into a graceful shutdown.
However, doing this seems to introduce the risk of no shutdown at all in some cases - the root of this bug. Users experiencing hanging dynos running Unicorn should consider the evidence and make their own decision based on first principles, not just documentation.

How can I tell unicorn to understand Heroku's signals?

Perhaps you've seen this...
2012-03-07T15:36:25+00:00 heroku[web.1]: Stopping process with SIGTERM
2012-03-07T15:36:36+00:00 heroku[web.1]: Stopping process with SIGKILL
2012-03-07T15:36:36+00:00 heroku[web.1]: Error R12 (Exit timeout) -> Process failed to exit within 10 seconds of SIGTERM
2012-03-07T15:36:38+00:00 heroku[web.1]: Process exited with status 137
This is a well known problem when running unicorn on heroku...
heroku uses SIGTERM for graceful shutdown
unicorn uses SIGTERM for quick shutdown
Can I tell heroku to send SIGQUIT? Or can I tell unicorn to treat SIGTERM as graceful shutdown?
Heroku now provides instruction for this here:
https://blog.heroku.com/archives/2013/2/27/unicorn_rails
Their suggested unicorn.rb file is:
# config/unicorn.rb
worker_processes 3
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.connection.disconnect!
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to sent QUIT'
end
defined?(ActiveRecord::Base) and
ActiveRecord::Base.establish_connection
end
This is a hack, but I've successfully created a unicorn config file that traps the TERM signal, preventing unicorn from receiving it and performing its quick shutdown. My signal handler then sends QUIT signal back to itself to trigger the unicorn graceful shutdown.
Tested with Ruby 1.9.2, Unicorn 4.0.1 and 4.2.1, Mac OS X.
listen 9292
worker_processes 1
# This is a hack. The code is run with 'before_fork' so it runs
# *after* Unicorn installs its own TERM signal handler (which makes
# this highly dependent on the Unicorn implementation details).
#
# We install our own signal handler for TERM and simply re-send a QUIT
# signal to our self.
before_fork do |_server, _worker|
Signal.trap 'TERM' do
puts 'intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
end
One concern is that (I believe) this signal handler is inherited by worker processes. But, the worker process installs its own TERM handler, which should overwrite this one, so I would not expect any issue. (See Unicorn::HttpServer#init_worker_process # lib/unicorn/http_server.rb:551.
Edit: one more detail, this block that installs the signal handler will run once per worker process (because before_fork), but this merely redundant and won't affect anything.

Use `reload` instead of `restart` for Unicorn?

I'm a little confused about my deploy strategy here, when deploying under what circumstances would I want to send a reload signal to unicorn? For example in my case it would be like:
sudo kill -s USR2 `cat /home/deploy/apps/my_app/current/tmp/pids/unicorn.pid`
I've been deploying my apps by killing that pid, then starting unicorn again via something like:
bundle exec unicorn -c config/unicorn/production.rb -E production -D
I'm just wondering why I'd want to use reload? Can I gain any performance for my deployment by doing so?
When you kill unicorn you cause downtime, until unicorn can start back up. When you use the USR2 signal, unicorn starts new workers first, then once they are running, it kills the old workers. It's basically all about removing the need to "turn off" unicorn.
Note, the assumption is that you have the documented before_fork hook in your unicorn configuration, in order to handle the killing of the old workers, should an ".oldbin" file be found, containing the PID of the old unicorn process:
before_fork do |server, worker|
# a .oldbin file exists if unicorn was gracefully restarted with a USR2 signal
# we should terminate the old process now that we're up and running
old_pid = "#{pids_dir}/unicorn.pid.oldbin"
if File.exists?(old_pid)
begin
Process.kill("QUIT", File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
# someone else did our job for us
end
end
end

Resources