I've implemented long-running tasks in my Rails app using delayed_job along with delayed_job_web. My delayed_job configuration instructs jobs to be attempted once, and for failures to be retained:
config/initializers/delayed_job.rb:
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false
I tried 2 test jobs that automatically raised errors, in order to see how failures behave. What I get is the following:
My expectation was that Failed jobs would have a count of 2, but that Enqueued / Working / Pending would all be 0. I can't find any documentation on what determines whether a job is Enqueued / Working / Pending, or even what the difference between Working and Pending is (the web interface describes both lists as "contains jobs currently being processed".)
Can anyone provide some clarity?
If you check https://github.com/ejschmitt/delayed_job_web/blob/master/lib/delayed_job_web/application/app.rb , you see the following (starting line 114):
when :working
'locked_at is not null'
when :failed
'last_error is not null'
when :pending
'attempts = 0'
end
Enqueued would be the total number of delayed jobs, i.e. Delayed::Job.count
Working jobs are those that have been locked by the delayed_job process and are currently being worked.
Failed are those that have a last_error
Pending are those jobs that have never been attempted.
Related
I am using the following command to configure the service failure recovery
sc failure "service" actions= ""/60000/restart/60000/run/120000 reset= 60 command = "\"c:\\windows\notepad2.exe
(used notepad2.exe just for testing)
From the Microsoft documentation here:-
Actions
This field contains an array of integer values that specify the
actions taken by the SCM if the service fails. Separate the values in
the array by [~]. The integer value in the Nth element of the array
specifies the action performed when the service fails for the Nth
time.
So, what I am getting from this is the count of failure will decide the action => For first failure Actions[0] will be executed and for the second Actions[1] will be executed and for all subsequent failures Actions[2] will be
I have following configuration for the service for testing this behavior:-
Then I tried killing the process under which service is running by using taskkill.
Here is the first log
Then I tried starting the service manually.
Then again I tried killing the service after ~ 2 mins ( => the reset count will set failure count to 0 as it is configured to 1 minute).
Here is the log for the error
In above figure, it is clear that why count is resetting to 0 because reset setting we have given60 sec and our service was running more than 2 mins.
But the action described for recovery is wrong as Restarting the service is the action for the second failure not for the first failure.
So why the count for failure is coming 1 but the action for recovery is the action corresponding to the second failure action?
I was just playing around with a similar issue, and after I set the "Reset fail count after:" to "1" day, it seems to be working. A possible explanation is that by setting the "Reset fail count" to 1 day, it will not reset the fail count back to 0 after the first fail (which is you stopping and restarting the service manually), and lets it cycle through the rest of the actions (depending on conditions/actions). Your mileage may vary.
I built a small web crawler implemented in two Sidekiq workers: Crawler and Parsing. The Crawler worker will seek for links while Parsing worker will read the page body.
I want to trigger an alert when the crawling/parsing of all pages is complete. Monitoring only the Crawler job is not the best solution since it may have finished but there might be several Parser jobs running.
Having a look at sidekiq-status gem it seems that I cannot dynamically add new jobs to the container for monitoring. E.g. it would be nice to have a "add" method in the following context:
#container = SidekiqStatus::Container.new
# ... for each page url found:
jid = ParserWorker.perform_async(page_url)
#container.add(jid)
The closest to this is to use "SidekiqStatus::Container.load" or "SidekiqStatus::Container.load_multi" however, it is not possible to add new jobs in the container a posteriori.
One solution would be to create as many SidekiqStatus::Container instances as the number of ParserJobs and check if all of them have status == "finished", but I wonder if a more elegant solution exists using these tools.
Any help is appreciated.
You are describing Sidekiq Pro's Batches feature exactly. You can spend a lot of time or some money to solve your problem.
https://github.com/mperham/sidekiq/wiki/Batches
OK, here's a simple solution. Using the sidekiq-status gem, the Crawler worker keeps track of the jobs IDs for the Parser jobs and halts if any Parser job is still busy (using the SidekiqStatus::Container instance to check job status).
def perform()
# for each page....
#jids << ParserWorker.perform_async(page_url)
# end
# crawler finished, parsers may still be running
while parsers_busy?
sleep 5 # wait 5 secs between each check
end
# all parsers complete, trigger notification...
end
def parsers_busy?
status_containers = SidekiqStatus::Container.load_multi(#jids)
for container in status_containers
if container.status == 'waiting' || container.status == 'working'
return true
end
end
return false
end
I have found the Resque:
https://github.com/elucid/resque-delayed
And I can see that I can schedule delayed Job. My question is, how does it check for delayed jobs? If I have 5000 delayed jobs in one month time, I hope it doesn't check every 10 seconds all delayed jobs.
So how is it being done?
It does not have to check all the delayed jobs. It maintains a sorted set in Redis, the jobs being sorted by their scheduled time. See the code at:
https://github.com/elucid/resque-delayed/blob/master/lib/resque-delayed/resque-delayed.rb
Each time the daemon awakes, only the first item of the set needs to be checked (using a ZRANGEBYSCORE command). The daemon fetches the relevant jobs one by one, until the polling query returns no result, then it sleeps again.
Performance could be further improved by fetching the jobs n by n. It could be implemented using a server-side Lua script as a polling query:
local res = redis.call('ZRANGEBYSCORE',KEYS[1], "-inf", ARGV[1], 'LIMIT', 0, 10 )
if #res > 0 then
redis.call( 'ZREMRANGEBYRANK', KEYS[1], 0, #res-1 )
return res
else
return false
end
In one roundtrip, this script gets 10 jobs (if available), and delete them from the zset. Much better than the 11 ZRANGEBYSCORE and 10 ZREM, currently required by Resque-delayed.
I have this in my initializer:
Delayed::Job.const_set( "MAX_ATTEMPTS", 1 )
However, my jobs are still re-running after failure, seemingly completely ignoring this setting.
What might be going on?
more info
Here's what I'm observing: jobs with a populated "last error" field and an "attempts" number of more than 1 (10+).
I've discovered I was reading the old/wrong wiki. The correct way to set this is
Delayed::Worker.max_attempts = 1
Check your dbms table "delayed_jobs" for records (jobs) that still exist after the job "fails". The job will be re-run if the record is still there. -- If it shows that the "attempts" is non-zero then you know that your constant setting isn't working right.
Another guess is that the job's "failure," for some reason, is not being caught by DelayedJob. -- In that case, the "attempts" would still be at 0.
Debug by examining the delayed_job/lib/delayed/job.rb file. Esp the self.workoff method when one of your jobs "fail"
Added #John, I don't use MAX_ATTEMPTS. To debug, look in the gem to see where it is used. Sounds like the problem is that the job is being handled in the normal way rather than limiting attempts to 1. Use the debugger or a logging stmt to ensure that your MAX_ATTEMPTS setting is getting through.
Remember that the DelayedJobs jobs runner is not a full Rails program. So it could be that your initializer setting is not being run. Look into the script you're using to run the jobs runner.
Can't get the max_failures idea. From the documentation:
This attribute specifies the number of times a job can fail on consecutive scheduled runs before it is automatically disabled.
So, let's suppose I have a schedule. Its running count is 100. Its failure count is 18. Its max failures is 20.
Current run has finished successfully.
I expect: if I will break it - it will run exactly 20 times on state FAILED after which it will be changed to BROKEN
What I get: it runs 2 times so failure count is 20 and despite the fact it were just 2 consecutive runs the schedule is changed to state BROKEN.
What have I missed?
I think "consecutive scheduled runs" means exactly that. If it succeeds, the failure count should be reset to 0.
EDIT
Guess I was wrong, sorry.
Reading up: http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/schedadmin004.htm
As per Gary's comment - looks like you need to reset the failure count manually.