Heroku Sidekiq Setup Still Getting ActiveRecord::ConnectionTimeoutError - heroku

I'm lost in the weeds. I'm trying to optimize my Sidekiq / Heroku setup with Puma. But I'm either getting a massive amount of jobs backed up in the queue or a huge amount failing and going in the retry queue.
I understand the concurrency optimization math to be:
Puma workers x RAILS_MAX_THREADS = which is 10 in my case
Or is it:
Puma workers x MIN/MAX PUMA THREADS = would be between 2 and 32 in my case
My working is running on 1 Pm dyno.
My puma config is this:
workers 2
threads 1, 16
My database.yml is this:
default: &default
adapter: postgresql
pool: 100
timeout: 5000
And RAILS_MAX_THREADS is 5
I also am using REDIS with a connection limit of 256, but this doesn't seem to be a factor??
And I'm booting sidekiq on Heroku with this:
worker: bundle exec sidekiq -c 15
When I was running sidekiq with 5 threads it was backing up, and 25 threads it was filling up the retry queue.
Is 10 sidekiq threads the best answer here?
EDIT:
I'm trying 10 now to see how it goes.

Related

Heroku workers crashing in laravel app if number of workers > 1

I've been using Heroku to host my application for several years and just started running into issues with the worker queue getting backlogged. I was hoping I could fix this by increasing the number of workers running so queued jobs could be completed in parallel, but whenever I scale up my number of workers, all but one crash.
Here's my Procfile:
web: vendor/bin/heroku-php-apache2 public
worker: php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30
Here's the output from my sever logs when I scale up my workers to anything greater than 1 (in this example, it was just scaling it to 2 workers):
Mar 16 06:04:51 heroku/worker.1 Starting process with command `php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30`
Mar 16 06:04:52 heroku/worker.1 State changed from starting to up
Mar 16 06:04:54 app/worker.1 Broadcasting queue restart signal.
Mar 16 06:04:58 heroku/worker.2 Process exited with status 0
Mar 16 06:04:58 heroku/worker.2 State changed from up to crashed
Mar 16 06:04:58 heroku/worker.2 State changed from crashed to starting
Mar 16 06:05:09 heroku/worker.2 Starting process with command `php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30`
Mar 16 06:05:10 heroku/worker.2 State changed from starting to up
Mar 16 06:05:14 app/worker.2 Broadcasting queue restart signal.
Mar 16 06:05:19 heroku/worker.1 Process exited with status 0
Mar 16 06:05:19 heroku/worker.1 State changed from up to crashed
As you can see, both workers try starting but only worker.2 stays in the up status.
The crashed workers try restarting every 10 minutes to the same result as above.
When I run heroku ps, here's what I see:
=== worker (Standard-1X): php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30 (2)
worker.1: crashed 2021/03/16 06:05:19 -0600 (~ 20m ago)
worker.2: up 2021/03/16 06:05:10 -0600 (~ 20m ago)
(my normal web dynos scale up and down just fine, so i'm not showing that in here).
Any thoughts as to what could be happening? My first thought was that there was an issue going on with Heroku, but I realized that wasn't the case. My second thought is that my Procfile entry for my worker could be causing problems, but I don't know enough about that entry to know what could be the cause.
Again, this has been working fine for 1 worker for a long time and the crashing only happens when I try to scale up to more than 1 worker. Regardless of how many workers I try scaling to, only one doesn't crash and remains active and able to receive and process jobs.
Misc info:
Heroku stack: Heroku-18
Laravel version: 8.*
Queue driver: Redis
Update - I scaled up the dynos on my staging environment and was able to scale the workers up and down without any kind of crashes. Now I'm thinking there might be some kind of add-on conflict or something else going on. I'll update this if I find anything else out (already reached out to Heroku support).
The problem was the php /app/artisan queue:restart command in the procfile. The workers starting and the restart command being called were causing conflicting signals and eventually caused all but one of the workers to crash.
I took out that command and I can scale my workers without issue now.
=== worker (Standard-1X): php /app/artisan queue:work redis --queue=high,default,sync,emails,cron --tries=3 --timeout=30 (2)
worker.1: up 2021/03/17 17:29:32 -0600 (~ 8m ago)
worker.2: up 2021/03/17 17:35:58 -0600 (~ 2m ago)
When a deployment is made to Heroku, the dynos receive a SIGTERM signal which kills any lingering processes and then the dynos are restarted. This means the php /app/artisan queue:restart command was redundant and unnecessary.
The main confusion came in the way Laravel worded the information about queue workers needing a restart here: https://laravel.com/docs/8.x/queues#queue-workers-and-deployment. This is necessary on servers where the dynos aren't handled the way Heroku does.

How to restart Sidekiq when running on Heroku?

I am running sidekiq in a worker on Heroku as follows:
bundle exec sidekiq -t 25 -e $RAILS_ENV -c 3
One of the operations uses more memory (>500mb) than the worker allows. After the job has completed, the memory still hasn't been released and I get these errors in the heroku rails log files:
2018-11-13T00:56:05.642142+00:00 heroku[sidekiq_worker.1]: Process running mem=646M(126.4%)
2018-11-13T00:56:05.642650+00:00 heroku[sidekiq_worker.1]: Error R14 (Memory quota exceeded)
Is there a way to automatically restart Sidekiq when the memory usage exceeds a certain amount?
Thanks!
have you tried to reduce memory fragmentations? here how you can do it in Heroku.
if that wasn't good enough you can use Heroku platform gem and periodically restart the sidekiq

Celery can't connect to broker but executes scheduled tasks

We are using Celery 4.1.0 with Django 1.33.1 on Heroku. There are two dynos configured in the Procfile, one for web, one for both Celery worker and beat:
web: gunicorn dms.wsgi --log-file=-
worker: celery -A dms.tasks worker -B --scheduler django_celery_beat.schedulers:DatabaseScheduler --without-gossip --without-mingle --without-heartbeat
There are several scheduled tasks configured which execute just fine. However, when trying to send tasks to the broker it says ConnectionRefusedError: [Errno 111] Connection refused, seemingly regardless of the type of broker (Actually the process just times out and only shows the error when sending a task manually but we suspect this is related to a different bug)
I tried RabbitMQ and Redis and got the same error message on both. It is a staging environment with a single user and I double-checked we didn't hit any queue/connection limits.
dms/settings.py:
CELERY_BROKER = os.environ.get('CELERY_BROKER')
CELERY_TASK_IGNORE_RESULT = True
BROKER_POOL_LIMIT = 1
CELERY_IMPORTS = (
'dms.tasks',
'core.tasks',
)
dms/tasks.py:
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'dms.settings')
app = Celery('dms', broker=settings.CELERY_BROKER)
app.config_from_object('django.conf:settings')
app.autodiscover_tasks()
We also contacted Heroku and CloudAMQP support.

Concurrency with Resque on Heroku (running multiple workers per node)

Pardon my ignorance, but is there to increase the number of processes per dyno for Resque workers? And if so, how?
I'm currently using Unicorn to add concurrency to the web dynos, which has been working great so far. I would like to extend this to the Resque workers. I followed Heroku's guide to set up the concurrency.
Update: The solution below works, but is not recommended. For resque concurrency on heroku use the resque-pool gem.
It is possible if you use the COUNT=* option. Your procfile will look something like:
web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb
resque: env TERM_CHILD=1 COUNT=2 RESQUE_TERM_TIMEOUT=6 QUEUE=* bundle exec rake resque:workers
It is impotant to note that the rake task in the Procfile is resque:workers and not resque:work.
Update Explanation
There are major problems with the COUNT=* option and the rake resque:workers invocation in production on heroku. Because of the way resque starts up the multiple workers using threads, all of the SIGTERM, SIGKILL, etc. handling that allows workers to stop the current job, re-enqueue job, and shut down properly (including de-registering) will ever happen. This is because the the signals are handled by the main process not trapped by the threads. This can cause phantom workers to remain in the worker list long after they've been killed. This is probably why there is a comment in the resque code that warns that the resque:workers should only be used in development mode.

How can I tell how many worker dynos I'm using on Heroku?

I'm using HireFireApp to autoscale my web and worker dynos on Heroku. However, when I navigate to the Resque app on my application it says
"0 of 46 Workers Working"
Does this mean that I'm using 46 worker dynos???
Update:
Running heroku ps shows:
web.1 up for 21m bundle exec thin start -p $PORT
worker.1 starting for 1s bundle exec rake resque:work QUEUE..
From the command line in your heroku app have a look at the output of
heroku ps
that will show you how many workers you are running.

Resources