I run into a weird issue with Sidekiq : My queues are randomly emptied.
The last time it occured, I had +/-5000 jobs in queue, and the next minute, they were all gone (but not done)
Is there some limitation I don't know about? Is this a configuration error on my side?
Thanks for helping
Related
in my system there are dead jobs but our monitoring did not catch any unrescued sidekiq errors. these jobs were scheduled to perform immediately. Is there a way to know why it go to the dead queue? I have run
job.error?
job.error_backtrace
but the first one returns false and the second one returns nil.
is it possible to know why it moved to dead queue? is it possible there were too many jobs and it waited for a period of time before it give up? thanks
We're running into an issue with the SOS-Berlin JobScheduler running on Windows that is difficult to diagnose* and I would appreciate any guidance.
*Difficult because I don't know Scala (though I do know C++ and Java). It's difficult to navigate this code-base (some of it's in German).
We have a process-class called Foo, that will sometimes burst up outside the limit of how many processes can be run. So, for example, we limit the process-class to 30 processes and 60 want to run. This leaves 30 running and 30 "waiting for process."
The problem is that JobScheduler doesn't seem to prioritize the 30 that are waiting for a process. Instead, any new job that gets fired after the burst receives processes, leaving some jobs waiting indefinitely. Once the number of jobs "waiting for process" hits zero, the jobs clear out immediately.
Further, it seems that when there are a large number of jobs "waiting for process," the run time for tasks doubles or triples. A job that normally takes 20 seconds to run, will spike to 1-2 minutes, further amplifying the issue as processes are not released back to the pool.
Admittedly, we're running an older version of JS, which we're planning to upgrade this/next week. However, I'm wondering if there is something fundamental we're missing. We've turned down the logging, looked for DB locks, added memory to the heap, shut-down some other processes on the server. We've also increased the process pool, but we don't want to push it too far, lest we crush the server. Nothing seems to be alleviating the issue.
Any tuning help would be appreciated!
As a follow-up, we determined the cause of the issue.
Another user had been using the temp directory to store intermediate generated files. The user was not clearing out these files, resulting in 100's of thousands of files in the directory. They were not very large so we didn't notice. For some reason Job Scheduler started to choke based on this. I'm not clear on the reasons.
Clearing the temp directory, scolding the user, and fixing his script fixed the issue.
I have a beanstalkd instance with two workers picking jobs from one tube.
I've noticed that occasionally one of the workers will reserve a job that has already been reserved (and being worked on) by the other worker.
I know there aren't duplicate jobs in the queue.
Why does beanstalkd allow the same job to be reserved twice?
It sounds to me that you didn't implemented the protocol properly. You need to handle DEADLINE_SOON, and do TOUCH.
What does DEADLINE_SOON mean?
DEADLINE_SOON is a response to a reserve command indicating that you have a job reserved whose deadline is real soon (current safety margin is approximately 1 second).
If you are frequently receiving DEADLINE_SOON errors on reserve, you should probably consider increasing the TTR on your jobs as it generally indicates you aren’t completing them in time. It may also be that you are failing to delete tasks when you have completed them.
See the mailing list discussion for more information.
How does TTR work?
TTR only applies to a job at the moment it becomes reserved. At that event, a timer (called “time-left” in the job stats) starts counting down from the job’s TTR.
If the timer reaches zero, the job gets put back in the ready queue.
If the job is buried, deleted, or released before the timer runs out, the timer ceases to exist.
If the job is touch"ed before the timer reaches zero, the timer starts over counting down from TTR.
The "touch" command
Allows a worker to request more time to work on a job.
This is useful for jobs that potentially take a long time, but you still want
the benefits of a TTR pulling a job away from an unresponsive worker. A worker
may periodically tell the server that it's still alive and processing a job
(e.g. it may do this on DEADLINE_SOON). The command postpones the auto
release of a reserved job until TTR seconds from when the command is issued.
The jobs take longer to run than the TTR, so it was being returned back to the queue and picked up by the other worker.
I now set a larger TTR on the job.
I need asynchronous, quick processing of everything in the queue. Jobs consist of CURL requests so it takes forever doing them 1 by 1 (They're basically the same as sleep(3)). I'd like all messages in the queue to run at the same time, or at least set a limit like 50. The reason I'm using a queue for this and not just running them instantly is because I need to make sure that if anything fails, it tries again.
Use the queue with iron.io ironMQ push, the queue shouldn't fail but in the unlikely even it does there is a log.
See this link for reference http://blog.iron.io/2013/05/laravel-4-ironmq-push-queues-insane.html
From memory you get 10 million requests free per month with ironMQ
I have set up queues in Laravel for my processing scripts.
I am using beanstalkd and supervisord.
There are 6 different tubes for different types of processing.
The issue is that for each tube, artisan is constantly spawning workers every second.
The worker code seems to sleep for 1 second and then the worker thread uses 7-15% cpu, multiply this by 6 tubes... and I would like to have multiple workers per tube.. my cpu is being eaten up.
I tried changing the 1 second sleep to 10 seconds.
This helps but there is still a huge cpu spike every 10 seconds when the workers wake back up.
I am not even processing anything at this time because the queues are completely empty, it is simply the workers looking for something to do.
I also tested to see the cpu usage of laravel when I refreshed the page in a brower and that was hovering around 10%.. I am on a low end rackspace instance right now so that could explain it but still... it seems like the workers spin up a laravel instance every time they wake up.
Is there no way to solve this? Do I just have to put a lot of money into a more expensive server just to be able to listen to see if a job is ready?
EDIT:
Found a solution... it was to NOT use the artisan queue:listener or queue:work
I looked into the queue code and there doesn't seem to be a way around this issue, it requires laravel to load every time a worker checks for more work to do.
Instead I wrote my own listener using pheanstalk.
I am still using laravel to push things into the queue, then my custom listener is parsing the queue data and then triggering an artisan command to run.
Now my cpu usage for my listeners is under %0, the only time my cpu shoots up now is when it actually finds work to do and then triggers the command, I am fine with that.
The problem of high CPU is caused because the worker loads the complete framework everytime it checks for a job in the queue. In laravel 4.2, you can use php artisan queue:work --daemon. This will load the framework once and the checking/processing of jobs happen inside a while loop, which lets CPU breathe easy. You can find more about daemon worker in the official documentation: http://laravel.com/docs/queues#daemon-queue-worker.
However, this benefit comes with a drawback - you need special care when deploying the code and you have to take care of the database connections. Usually, long running database connections are disconnected.
I had the same issue.
But I found another solution. I used the artisan worker as is, but I modified the 'watch' time. By default(from laravel) this time is hardcoded to zero, I've changed this value to 600 (seconds). See the file:
'vendor/laravel/framework/src/Illuminate/Queue/BeanstalkdQueue.php'
and in function
'public function pop($queue = null)'
So now the work is also listening to the queue for 10 minutes. When it does not have a job, it exits, and supervisor is restarting it. When it receives a job, it executes it after that it exists, and supervisor is restarting it.
==> No polling anymore!
notes:
it does not work for iron.io queue's or others.
it might not work when you want that 1 worker accept jobs from more than 1 queue.
According to this commit, you can now set a new option to queue:listen "--sleep={int}" which will let you fine tune how much time to wait before polling for new jobs.
Also, default has been set to 3 instead of 1.