Failed Service Task state after application reboot - spring-boot

What happens to a failed Service Task when the application is restarted. Will it attempt to retry the Service task once again.
If not, how does the Process proceed. Is there a way to manually retry the Service task ?
I could see from Flowable API, that taskQuery lists only the UserTasks.

In Flowable when an execution fail (due to some exception), the transaction would be rollbacked to the previous wait state in your workflow. This means that the next time it runs it would continue from there.
You can make a Service Task async which would make it a wait state. In that case an Async Job would be created and that would be retried (by default 3 times, but that is configurable). If the Service Task fails all the time (the configured number of retries) the job would be moved into the dead letter job and then you need to manually trigger it.

Related

How to perform clean ups on spring cloud task completion

I am writing a SCDF spi implementation for supporting stream and task application. As part of this we need to perform some clean up operations once the task finishes.
Could someone provide info on whether SCDF will be getting a callback on task completion. If not then what are the alternative ways to perform cleanup.
A Task is a short-lived and finite operation. Depending on what you're trying to accomplish, you can do one of the following to invoke any custom cleanup routine.
1) A task running a batch-job and in that job, you can define "n" number of steps as part of a workflow, and upon successful upstream steps, the last step could invoke the cleanup routine.
2) You can have a stream in SCDF listening to Task-complete events (a batch-job example here), which can finally kickoff another task/job to invoke the cleanup routine.
3) You can define a composed-task graph (via Dashboard/shell) where each of the steps (aka tasks) can run its intended operation, and upon successful transition or failure event, you get the opportunity to kick off the cleanup routine.

Configure Marathon to not restart tasks that enter TASK_FINISHED

I launch tasks via Marathon. However they finish, Marathon restarts them. I would like them to only restart if they finish in failure. Is there a way such that if the task enters the state of TaskStatus.TASK_FINISHED that Marathon will not restart it, e.g., by suspending the job, i.e., by setting the number of tasks to zero?
Currently when my task completes successfully I PUT a message to the Marathon REST API scaling the job down to 0 instances. This is fine except that in response Marathon kills the task setting its status to TASK_KILLED and I would like it to be TASK_FINISHED to indicate its success.
If you have one-of tasks as you describe, I think the better solution would be to use a scheduler like
https://mesos.github.io/chronos/ or its successor
https://github.com/dcos/metronome
Marathon is normally used to keep tasks running, and rescheduling them if they reach a final task state.
See the Marathon docs, and also this explanation of different task types.

ensuring that a mesos task is not running after a TASK_LOST status update

I am trying to write a simple Mesos framework that can relaunch tasks that don't succeed.
The basic algorithm, which seems to be mostly working, is to read in a task list (e.g. shell commands) and then launch executors, waiting to hear back status messages. If I get TASK_FINISHED, that particular task is done. If I get TASK_FAILED/TASK_KILLED, I can retry the task elsewhere (or maybe give up).
The case I'm not sure about is TASK_LOST (or even slave lost). I am hoping to ensure that I don't launch another copy of a task that is already running. After getting TASK_LOST, Is it possible that the executor is still running somewhere, but a network problem has disconnected the slave from the master? Does Mesos deal with this case somehow, perhaps by having the executor kill itself (and the task) when it is unable to contact the master?
More generally, how can I make sure I don't have two of the same task running in this context?
Let me provide some background first and then try to answer your question.
1) The difference between TASK_LOST and other terminal unsuccessful states is that restarting a lost task could end in TASK_FINISHED, while failed or killed will most probably not.
2) Until you get a TASK_LOST you should assume your task is running. Imagine a Mesos Agent (Slave) dies for a while, but the tasks may still be running and will be successfully reconciled, even though the connection is temporarily lost.
3) Now to your original question. The problem is that it is utterly hard to have exactly once instance running (see e.g. [1] and [2]). If you have lost connection to your task, that can mean either a (temporary) network partition or that your task has died. You basically have to choose between two alternatives: either having the possibility of multiple instances running at the same time, or possibly having periods when there are no instances running.
4) It's not easy to guarantee that two tasks are not running concurrently. When you get a TASK_LOST update from Mesos it means either your task is dead or orphaned (it will be killed once reconciled). Now imagine a situation when a slave with your task is disconnected from the Mesos Master (due to a network partition): while you will get a TASK_LOST update and the Master ensures the task is killed once the connection is restored, your task will be running on the disconnected slave until then, which violates the guarantee given you have already started another instance once you got the TASK_LOST update.
5) Things you may want to look at:
recovery_timeout on Mesos slaves regulates when tasks commit suicide if the mesos-slave process dies
slave_reregister_timeout on the Mesos Master specifies how much time do slaves have to reregister with the Mesos Master and have their tasks reconciled (basically, when you get TASK_LOST updates for unreachable tasks).
[1] http://antirez.com/news/78
[2] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
You can assume that TASK_LOST really means your task is lost and there is nothing you can do but launch another instance.
Two things to keep in mind though:
Your framework may register with failover timeout which means if your framework cannot communicate with slave for any reason (network unstable, slave died, scheduler died etc.) then Mesos will kill tasks for this framework after they fail to recover within that timeout. You will get TASK_LOST status after the task is actually considered dead (e.g. when failover timeout expires).
When not using failover timeout tasks will be killed immediately when connectivity is lost for any reason.

Resque poping instead of failing

I have a simple code for ruby- resque.
But due to some code issues, some jobs fail.
The problem is that they do not appear in failed jobs, they just disappear (once job is taken from the queue, it's poped, so it's removed).
How can I make resque to put the job in failed position?
I see the problem. Once the job is taken form the queue, it is removed from redis. Once it fails it's back in the redis in failed jobs. But the moment you stop the worker (ctrl+c) it doesn't have time to put it back in the failed jobs

Windows Services Recovery not restarting service

I configure the recovery for Windows services to restart with a one minute delay after failures. But I have never gotten it to actually restart the service (even with the most blatant errors).
I do get a message in the EventViewer:
The description for Event ID ( 1 ) in Source ( MyApp.exe ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: Access violation at address 00429874 in module 'MyApp.exe'. Write of address 00456704.
Is there something else I have to do? Is there something in my code (I use Delphi) which needs to be set to enable this?
Service Recovery is intended to handle the case where a service crashes - so if you go to taskmgr and right click "end process" on your service process, the recovery logic should kick in. I don't believe that the service recovery logic kicks in if your service exits gracefully (even if it exits with an error).
Also the eventvwr message indicates that your application called the ReportEvent API specifying event ID 1. But you haven't registered your event messages with the event viewer so it can't convert event ID 1 into a meaningful text string.
Service Recovery only works for unexpected exit like (exit(-1)) call.
For all the way we use to stop the service in usual way will not works for recovery.
If you want to stop service and still wants recovery to work, call exit(-1) and you will see error message as "service stopped with unexpected error" , and then your service will restart as recovery setting is.
The Service Control Manager will attempt to restart your service if you've set it up to be restarted by the SCM. This is detailed here in the documentation for the SERVICE_FAILURE_ACTIONS structure.
A service is considered failed when it
terminates without reporting a status of SERVICE_STOPPED to the
service controller.
This can be fine tuned by setting the SERVICE_FAILURE_ACTIONS_FLAG structure's fFailureActionsOnNonCrashFailures flag, see here). You can set this setting from the Services applet by checking the "Enable actions for stops with errors" checkbox on the recovery tab.
If this member is TRUE and the service has configured failure actions, the failure actions are queued if the service process terminates without reporting a status of SERVICE_STOPPED or if it enters the SERVICE_STOPPED state but the dwWin32ExitCode member of the SERVICE_STATUS structure is not ERROR_SUCCESS (0).
If this member is FALSE and the service has configured failure actions, the failure actions are queued only if the service terminates without reporting a status of SERVICE_STOPPED.
So, depending on how you have structured your service, how you have configured your failure actions AND what you do when you have your 'fatal error' it may be enough to call ExitProcess() or exit() and return a non zero value. However, it's probably safest to ensure that your service exits without the code that's dealing with the SCM telling the SCM that your service has reached the SERVICE_STOPPED state. This ensures that your failure actions ALWAYS happen...
If you 'kill' service from task manager - forgot for recovery logic. In background task manager 'kills' process by 'stop service'. and as yuo can guess - this is not service failure. This forced me to kill it really with Visual Studio. In task manager right click on service process. Select debug.
In Visual studio select Debug-> Terminate All.
And now you have simulated service fail. In this case recovery logic works fine.

Resources