My app runs on Heroku with unicorn and uses sucker_punch to send a small quantity of emails in the background without slowing the web UI. This has been working pretty well for a few weeks.
I changed the unicorn config to the Heroku recommended config. The recommended config
includes an option for the number of unicorn processes and I upped the number of processes from 2 to 3.
Apparently that was too much. The sucker_punch jobs stopped running. I have log messages that indicate when they are queued and I have messages that indicate when they start processing. The log shows them being queued but the processing never starts.
My theory is that I exceeded memory by going from 2 to 3 unicorns.
I did not find a message anywhere indicating a problem.
Q1: should I expect to find a failure messsage somewhere? Something like "attempting to start sucker_punch -- oops, not enough memory"?
Q2: Any suggestions on how I can be notified of a failure like this in the future.
Thanks.
If you are indeed exceeding dyno memory, you should find R14 or R15 errors in your logs. See https://devcenter.heroku.com/articles/error-codes#r14-memory-quota-exceeded
A more likely problem, though, given that you haven't found these errors, is that something within the perform method of your sucker punch worker is throwing an exception. I've found sucker punch tasks to be a pain to debug because it appears the lib swallows all exceptions silently. Try instantiating your task and calling perform on it from a rails console to make sure that it behaves as you expect.
For example, you should be able to do this without causing an exception:
task = YourTask.new
task.perform :something, 55
Related
I'm familiar with Heroku's policy on dyno restarts (once every 24 hours, if underlying hardware fails, etc).
I've also looked elsewhere on Stackoverflow and found questions like
Controlling Heroku's random dyno restarts.
Our apps are good from a web perspective – sessions are handled via external database and multiple dynos balance the load perfectly. Restarts aren't an issue there. The issue is workers. For example, our worker receives a message and begins processing a job. It's 99% done, awaiting some final asynchronous request to return, and it receives SIGTERM. Before it can even clean up the job, the process is killed. The code can handle local cleanup for a job that needs to be restarted, but external services can't really be part of the transaction.
For example, if a report was built and an email was sent, but some 3rd async operation didn't complete before SIGTERM, I can't really roll back that transaction. For hardware failures or other rare events, it's understandable that a multi-step transaction could get truncated, but with Heroku's policy it seems that I need to assume this will happen at least once a day. Can anyone help me get a better grip on this problem?
Rails app which handle and activation of a license using an external service, the external service sometime delays the handling of rails request to over 30s, which will then return an error to front end (I'm running heroku, so max is 30s).
I tried using ActiveJobs and the default rails async adapter (Rails 5), and I can see that is working in Heroku out of the box. I keep reading that I should be using another web process and for example redis, but if the background job should just be performed straight after the request is done and if is just hitting another API outside which may be slower, is it so bad to use the default async?
I can see that this is handle in an in-process thread but I don't see a reason for such small job to be having another web process.
I use the async adapter in production for sending emails. This is a very small job. An email could take up to 3 seconds to send.
The doc said it's a poor fit for production because it will drop pending jobs on restart. If I remember correctly, Heroku restarts dynos once a day.
If your job is pending during the restart, the job will be lost. For my case, a pending email during the restart is pretty slim. So far so good.
But if you have jobs taking 30 seconds, I'll use Resque or DelayedJob.
If for small background job in production, which does not require 100% persistence in case of failure/server restart, whose duration is relatively short and thus separate process would be an overkill, I'd recommend using Sucker Punch.
Sucker Punch gem is designed to handle exactly such case. It prepares execution thread pool for each Job you create, using the concurrent-ruby gem, which is (probably) the most robust concurrency library in Ruby. It also hooks on_exit to finish all the pending tasks, so I guess you can expect this gem to be more reliable than the AsyncJob.
One thing to note is that although Sucker Punch is supported on Active Job, the adapter is not well written. Or, at least, when you use Sucker Punch adapter, it's behavior would be just like that of async adapter. So, I'd recommend using bare Sucker Punch if you wanted something just a little more useful/robust than AsyncJob.
Hi we have created a Syslog Drain from pcf to logstash but sometimes we are getting 2018-07-19T15:09:53.524+05:30 [LGR/] [ERR] Syslog Drain: Error when writing. Backing off for 4ms.
this error.
What is this and why?
I suspect that it's a communication problem with the logging system in your Cloud Foundry platform as it's trying to talk with your LogStash. The message doesn't give you an exact error though. To find that, you would need to be a platform operator and look at the Loggregator logs to see why it's failing. If you're not the CF platform operator, reach out to your operator for assistance.
When you see errors like this I would suggest checking for two things:
How often do you see this message?
How large does the number in "Backing off for XXms." get?
When an error occurs sending logs the platform will back off, but as errors continue to occur the backoff timeout will get larger. If you see a large value in the backoff timeout, that means you have a prolonged problem. This could be something like you've configured the log drain incorrectly, your LogStash server is down or the network to it is down. If you see the errors frequently, but the number stays low, it means it's only intermittently failing (some logs go OK, some don't) which could point to a flaky network connection, one that's up/down a lot.
Is it possible to somehow view sidekiq completed job list - for example, find all PurchaseWorkers with params (1)? Yesterday in my app delayed method that was supposed to run didn't and associated entity (lets say 'purchase') got stuck in limbo with state "processing". I am trying to understand whats the reason: job wasn't en-queued at all or was en-queued but for some reason exited unexpectedly. There were no errors in sidekiq log.
Thanks.
This is old but I wanted to see the same thing since I'm not sure if jobs I scheduled ran or not!
Turns out, Sidekiq doesn't have anything built in to see jobs that completed and still doesn't seem to.
If it err'd and never completes it should be in the 'dead' queue. But to check that something actually ran seems to be beyond Sidekiq by default.
The FAQ suggests installing 3rd party plugins to track and log information: https://github.com/mperham/sidekiq/wiki/FAQ#how-can-i-tell-when-a-job-has-finished One of them allows for having a callback to do follow up (maybe add a record for completed jobs elsewhere?)
You can also setup Sidekiq to log to somewhere other than STDOUT (default) so you can output log information about your jobs. In this case, logging that it's complete or catching errors if for some reason it is never landing in the retrying or dead jobs queue when there is a problem. See https://github.com/mperham/sidekiq/wiki/Logging
To see jobs still in queue you can use the Rails console and look at the queue by queue name https://www.rubydoc.info/gems/sidekiq/Sidekiq/Queue
One option is the default stats provided by sidekiq - https://github.com/mperham/sidekiq/wiki/Monitoring#using-the-built-in-dashboard
The best options is to use the Web UI provided here - https://github.com/mperham/sidekiq/wiki/Monitoring#web-ui
I'm running an eventmachine process on heroku, and it seems to be hitting their memory limit of 512MB after an hour or so. I start seeing messages like this:
Error R14 (Memory quota exceeded)
Process running mem=531M(103.8%)
I'm running a lot of events through the reactor, so I'm thinking maybe the reactor is getting backed up (I'm imagining it as a big queue)? But there could be some other reason, I'm still fairly new to eventmachine.
Are there any good ways to profile eventmachine and and get some stats on it. As a simple example, I was hoping to see how many events were scheduled in the queue to see if it was getting backed up and keeping those all in memory. But if anyone has other suggestions I'd really appreciate it.
Thanks!
I use eventmachine extensively and never ran into any memory leak inside the reactor so your bet is that the ruby code is but without knowing more about your application it is hard to give you a real answer.
The only queue I can think of right now is the thread pool, each time you use the defer method the block is either given to a free thread or queued waiting for a free thread, I suppose if all your threads are blocking waiting for something the queue could grow and use all the memory available.
The leak turned out to be in Mongoid's identity_map (nothing to do with EventMachine). Setting Mongoid.identity_map_enabled = false at the beginning of the event machine process resolved it.