Schedules and `max_failures` attribute - oracle

Can't get the max_failures idea. From the documentation:
This attribute specifies the number of times a job can fail on consecutive scheduled runs before it is automatically disabled.
So, let's suppose I have a schedule. Its running count is 100. Its failure count is 18. Its max failures is 20.
Current run has finished successfully.
I expect: if I will break it - it will run exactly 20 times on state FAILED after which it will be changed to BROKEN
What I get: it runs 2 times so failure count is 20 and despite the fact it were just 2 consecutive runs the schedule is changed to state BROKEN.
What have I missed?

I think "consecutive scheduled runs" means exactly that. If it succeeds, the failure count should be reset to 0.
EDIT
Guess I was wrong, sorry.
Reading up: http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/schedadmin004.htm
As per Gary's comment - looks like you need to reset the failure count manually.

Related

Laravel Task Scheduling - when does the execution starts

Laravel Task Scheduling has options like, everyMinute() which will run a command every minute. I want to know what is the time it actually starts executing. Is it when I run the server or any specific second of the minute?
I am trying to run a function depending on the time difference. Here's a pseudocode
Run myCustomFunction() if diffInMin(now, customTime) <= 1
I thought it will run 1 time but it ran twice every time.
The scheduler usually runs every minute right around the zero secound mark based on the server's current time as #apokryfos mentioned.
Assuming the customTime is a fixed DateTime, what makes you think the code you wrote will only run once?
When now() === customTime the diffInMin() would be zero so the
condition diffInMin(now, customTime) <= 1 will evaluate to true.
The next minute, the diffInMin() would be 1, so the
condition diffInMin(now, customTime) <= 1 will still evaluate to true.

PromQl: alert on first value of a counter

I have a prometheus counter (spring_batch_job_seconds_count{status=~'FAILED'}) that counts job failures. I want to graph job failures over time and alert on job failures. The increase function gives me what I want except for the first occurrence. The counter is not published until a failure occurs, so there is no increase (or delta or rate) on the first failure event since there is no previous counter value of 0 to compare the first non-zero counter value to. How can I create a graph that will show the first failure occurrence (as well as subsequent failure occurrences) and a corresponding alert that will trigger on the first failure occurrence (as well as future failure occurrences)? I might be willing to settle for two alerts: one that triggers when the counter increments, and one that triggers on the first occurrence, but I would not want to have to manually shut off the alert that triggers on the first occurrence after it triggers for the first time.
I managed to do this with falco metrics.
I want to alert on any change, even the first time a metric appears.
(sum(falco_events{k8s_pod_name="runner"} or falco_events{} * 0) by (k8s_pod_name, rule) - sum(falco_events{k8s_pod_name="runner"} offset 5m or falco_events{} * 0) by (k8s_pod_name, rule))
Workaround from here: https://github.com/prometheus/prometheus/issues/1673

What is "INTERVAL=0" means in Oracle Schedular?

My Oracle DBA have setup a task with following repeat_interval:
Start Date :"30/JAN/20 08:00AM"
Repeat_interval: "FREQ=DAILY; INTERVAL=0; BYMINUTE=15"
Can I ask what is "Interval=0" means?
Does it means this task will run daily from 8AM, and will repeat every 15 mins until success?
I tried to get the answer from Google, but what I find is what is Interval=1, but nothing for 0.
So would be great if anyone can share me some light here.
Thanks in advance!
INTERVAL is the number of increments of the FREQ value between executions. I believe in this case that a value of 0 or 1 would be the same. The schedule as shown would execute once per day (FREQ=DAILY), at approximately 15 minutes past a random hour (BYMINUTE=15, but BYHOUR and BYSECOND are not set).
Schedule has nothing to do with whether or not the previous execution succeeded or not. Start Date is only the date at which the job was enabled, not when it actually starts processing.
If you want it to run every 15 minutes from the moment you enable it, you should set as follows:
FREQ=MINUTELY; INTERVAL=15
If you want it to run exactly on the quarter hour, then this:
FREQ=MINUTELY; BYMINUTE=0,15,30,45; BYSECOND=0
If you want it to run every day at 8am, then this:
FREQ=DAILY; BYHOUR=8; BYMINUTE=0; BYSECOND=0

How windows recovery work with failure count?

I am using the following command to configure the service failure recovery
sc failure "service" actions= ""/60000/restart/60000/run/120000 reset= 60 command = "\"c:\\windows\notepad2.exe
(used notepad2.exe just for testing)
From the Microsoft documentation here:-
Actions
This field contains an array of integer values that specify the
actions taken by the SCM if the service fails. Separate the values in
the array by [~]. The integer value in the Nth element of the array
specifies the action performed when the service fails for the Nth
time.
So, what I am getting from this is the count of failure will decide the action => For first failure Actions[0] will be executed and for the second Actions[1] will be executed and for all subsequent failures Actions[2] will be
I have following configuration for the service for testing this behavior:-
Then I tried killing the process under which service is running by using taskkill.
Here is the first log
Then I tried starting the service manually.
Then again I tried killing the service after ~ 2 mins ( => the reset count will set failure count to 0 as it is configured to 1 minute).
Here is the log for the error
In above figure, it is clear that why count is resetting to 0 because reset setting we have given60 sec and our service was running more than 2 mins.
But the action described for recovery is wrong as Restarting the service is the action for the second failure not for the first failure.
So why the count for failure is coming 1 but the action for recovery is the action corresponding to the second failure action?
I was just playing around with a similar issue, and after I set the "Reset fail count after:" to "1" day, it seems to be working. A possible explanation is that by setting the "Reset fail count" to 1 day, it will not reset the fail count back to 0 after the first fail (which is you stopping and restarting the service manually), and lets it cycle through the rest of the actions (depending on conditions/actions). Your mileage may vary.

Quartz.NET: Need CronTrigger on an iStatefulJob instance to *delay instead of skip* if running job while schedule matures

Greetings, your friendly neighborhood Quartz.NET n00b is back!
I have a Windows Service running iStatefulJob instances on a Quartz.NET CronTrigger based schedule scheme... The CRON String used to schedule the job: "0 0/1 * * * ? *"
Everything works great. However, if I have a job that is set to run, say, at the X:00 mark of every minute, and that job happens to run for MORE than a minute, I notice that the subsequent job runs IMMEDIATELY after the job is finished executing, rather than waiting until its next scheduled run, effectively "queuing" up instead of merely skipping the job till it's next scheduled run.
I put in the trigger a CronTrigger MisfireInstruction of DONOTHING, but the exact same thing happens when a job overruns its next scheduled execution schedule.
How do I get an iStatefulJob instance to merely SKIP a scheduled execution trigger if it is currently running, rather than have it delay it until the first execution completes?
I explicitly set the trigger.MisfireInstruction = MisfireInstruction.CronTrigger.DoNothing;
...But instead of "doing nothing", for a job scheduled to run every minute that takes 90 seconds to complete, I experience the following execution log:
Job runs at 9:00:00am, finishes at 9:01:30am <- job runs for 1:30
Job runs at 9:01:30am, finishes at 9:03:00am <- subsequent job that should have run at 9:01:00
Job runs at 9:04:00am, finishes at 9:05:30am <- shouldn't this one have run at 9:03:00?
Job runs at 9:05:30am, finishes at 9:07:00am <- subsequent job that should have run at 9:05:00
Job runs at 9:08:00am, finishes at 9:09:30am <- shouldn't this have run at 9:07:00?
... it seems like it runs correctly the first time, on the minute... delays for 30 seconds as the 90 second job execution time expires, and then, instead of waiting till the NEXT full minute, EXECUTES IMMEDIATELY at the 30 second mark... Doubly odd, is that it then finishes the SECOND job on the minute mark, but waits till the NEXT minute mark to execute instead of running it back-2-back...
Pretty much seems like it works correctly EVERY OTHER RUN, when it is not running on the :30 marks...
What's the best way to get a job not to delay/queue, but to just SKIP until it is idle and the next schedule matures?
EDIT: I tried going back to iJobs instead of iStatefulJobs using the same DONOTHING trigger misfire instruction, but the job executes EVERY MINUTE despite the prior execution being still active. I can't seem to get it to skip a scheduled run if it is currently running with either iJob or iStatefulJob...
EDIT#2: I think that my triggers are NEVER misfiring, which is why DoNothing as a misfire instruction is useless... Given that's the case, I guess I need another mechanism to detect if a job instance of a schedule is running to ensure the job SKIPS its next execution until its following scheduled time rather than delaying it until first instance completion...
EDIT3: I tried adding an element to the iStatefulJob jobdatamap called "IsRunning"... I set it to TRUE when the execute sequence starts, and then return it to false after job completion. Before executing, it checks the element, which is apparently persisted between jobs, and prematurely quits the execution (logging "JOB SKIPPED!") if it detects it to be true... This unfortunately doesn't work, for probably obvious reasons: If the jobs are running following the bulleted schedule above, then the job is never SIMULTANEOUSLY running along with itself, as it is delaying the run till the job ends, so this check is useless. According to documentation, returning to iJob from iStatefulJob would not help here as the jobdatamap is only persisted between jobs in the Stateful job type...
I still haven't solved how to SKIP a scheduled job instead of delaying it till it's current iteration completes... If anyone has ideas, you're a lifesaver! :)
It should be caused by misfireThreshold of RAMJobStore (http://quartznet.sourceforge.net/apidoc/topic2722.html).
The time span by which a trigger must
have missed its next-fire-time, in
order for it to be considered
"misfired" and thus have its misfire
instruction applied.
It is 60 seconds by default. So job isn't considered as "misfired" until it is late for more than misfiredThreshold value.
To resolve the problem just decrease this threshold (below code sets to 1 ms):
...
properties["quartz.jobStore.misfireThreshold"] = "1";
...
schedulerFactory = new StdSchedulerFactory(properties);
It should resolve the issue.

Resources