Handling a System Restart/Crash in between changing a Flag - algorithm

I have a following scenario
Change a Flag = start (in database)
Do some processing
Update the Flag back to Finished (in database)
Suppose the system crashes during the step 2. Ideally I would want to set the Flag back to Finished. But because of the system crash it doesn't and it falls into deadlock for that task.
What are the standard solutions/approaches/algorithms followed to address such scenario?
Edit: How the deadlock occurs?
The task will be picked only if the Flag = Finished. Flag = start means it is in progress in the middle of something. So when there is a crash, the task is not complete but the Flag is also not set to Finish next the the system runs. So the task is not going to be picked again.

I don't see any simple solution here.
If your tasks execution time is predictable enough you can store a timestamp of task execution start in your DB and return task to "empty" state (not started yet) on timeout.
Or you can store process ID in your DB and implement a supervisor process that will launch your "executor" processes and check their exit code. If process crashed supervisor would "reinitialise" all tasks marked with crashed process ID.

Related

Trains: Can I reset the status of a task? (from 'Aborted' back to 'Running')

I had to stop training in the middle, which set the Trains status to Aborted.
Later I continued it from the last checkpoint, but the status remained Aborted.
Furthermore, automatic training metrics stopped appearing in the dashboard (though custom metrics still do).
Can I reset the status back to Running and make Trains log training stats again?
Edit: When continuing training, I retrieved the task using Task.get_task() and not Task.init(). Maybe that's why training stats are not updated anymore?
Edit2: I also tried Task.init(reuse_last_task_id=original_task_id_string), but it just creates a new task, and doesn't reuse the given task ID.
Disclaimer: I'm a member of Allegro Trains team
When continuing training, I retrieved the task using Task.get_task() and not Task.init(). Maybe that's why training stats are not updated anymore?
Yes that's the only way to continue the same exact Task.
You can also mark it as started with task.mark_started() , that said the automatic logging will not kick in, as Task.get_task is usually used for accessing previously executed tasks and not continuing it (if you think the continue use case is important please feel free to open a GitHub issue, I can definitely see the value there)
You can also do something a bit different, and justcreate a new Task continuing from the last iteration the previous run ended. Notice that if you load the weights file (PyTorch/TF/Keras/JobLib) it will automatically connect it with the model that was created in the previous run (assuming the model was stored is the same location, or if you have the model on https/S3/Gs/Azure and you are using trains.StorageManager.get_local_copy())
previous_run = Task.get_task()
task = Task.init('examples', 'continue training')
task.set_initial_iteration(previous_run.get_last_iteration())
torch.load('/tmp/my_previous_weights')
BTW:
I also tried Task.init(reuse_last_task_id=original_task_id_string), but it just creates a new task, and doesn't reuse the given task ID.
This is a great idea for an interface to continue a previous run, feel free to add it as GitHub issue.

NiFi - Stop upon failure

I've been trying to google and search stack for the answer but have beeen unable to find.
Using NiFi, is it possible to stop a process upon previous job failure?
We have user data we need to process but the data is sequentially constructed so that if a job fails, we need to stop further jobs from running.
I understand we can create scripts to fail a process upon previous process failure, but what if I need entire group to halt upon failure, is this possible? We don't want each job in queue to follow failure path, we want it to halt until we can look at the data and analyze the failure.
TL;DR - can we STOP a process upon a failure, not just funnel all remaining jobs into the failure flow. We want data in queues to wait until we fix, thus stop process, not just fail again and again.
Thanks for any feedback, cheers!
Edit: typos
You can configure backpressure on the queues to stop upstream processes. If you set the backpressure threshold to 1 on a failure queue, it would effectively stop the processor until you had a chance to address the failure.
The screenshot shows failure routing back to the processor, but this is not required. It is important that the next processor should not remove it from the queue to maintain the backpressure until you take action.

Oracle streams - waiting for redo when the redo file is gone forever

I have streams configured, which stopped working after a while.
To resync, I stopped all capture/apply processes and exported the tables from source to target.
After starting up again, it still says waiting for that same redo file.
Is it possible to "restart" streams from a current file?
Do you want to "shift" your capture process ahead in time? You could try to switch your capture process to a certain SCN by invoking
DBMS_CAPTURE_ADM.ALTER_CAPTURE('YOUR_CAPTURE_NAME', start_scn => :SCN);
where :SCN is a valid system change number from which you want your messages to be captured. Take a look at DBMS_CAPTURE_ADM description.

How to restart a program in terminal periodically?

I am calling a program, let say myprogram, from the terminal (in OS X Mavericks) but some times it gets stuck due to external problems out of my control. This tends to happen approximately every half an hour.
myprogram basically has to perform a large quantity of small subtasks, which are saved in a file that is read in every new execution, so there is no need to recompute everything from the beginning.
I would like to fully automatize the restarting of the program by killing and restarting it again, in the following way:
Start the program.
Kill it after 30 minutes (the program will be probably stuck).
Restart it (back to step 1).
Any ideas on how to do this? My knowledge of bash scripting is not great precisely...
The following script can serve as a wrapper script for myprogram
#!/bin/bash
while true #begin infinite loop (you'll have to manually kill)
do
./myprogram & #execute myprogram and background
PID=$! #get PID of myprogram
sleep 1800 #sleep 30 minutes (30m might work as parameter)
kill -9 $PID #kill myprogram
done
You could use a wrapper, but, an infinite loop is not an optimal solution. If you are looking to relaunch a program on timer, or not, depending on the exit code and are on OS X, you should use launchd configuration (xml property list) files and load them with launchctl.
KeepAlive <boolean or dictionary of stuff>
This optional key is used to control whether your job is to be kept continuously running or to let
demand and conditions control the invocation. The default is false and therefore only demand will start
the job. The value may be set to true to unconditionally keep the job alive. Alternatively, a dictio-nary dictionary
nary of conditions may be specified to selectively control whether launchd keeps a job alive or not. If
multiple keys are provided, launchd ORs them, thus providing maximum flexibility to the job to refine
the logic and stall if necessary. If launchd finds no reason to restart the job, it falls back on
demand based invocation. Jobs that exit quickly and frequently when configured to be kept alive will
be throttled to converve system resources.
SuccessfulExit <boolean>
If true, the job will be restarted as long as the program exits and with an exit status of zero.
If false, the job will be restarted in the inverse condition. This key implies that "RunAtLoad"
is set to true, since the job needs to run at least once before we can get an exit status.
...
ExitTimeOut <integer>
The amount of time launchd waits before sending a SIGKILL signal. The default value is 20 seconds. The
value zero is interpreted as infinity.
For more information on launchd & plists visit :
https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man5/launchd.plist.5.html

Attach gdb to process before I know the process id

I am debugging a process on a web server running Linux. The process is invoked once a request is coming from a web-page. In order to debug the process, I look at the running processes list (using top), I spot the relevant process (named apache2) by it's CPU usage (quite easy, since it is usually on top of the list), and I attach the gdb session to the process id. Of course I can call the attach PID command only after the process is up.
The only problem is that this process-id-spotting takes a second or two, so I cannot stop at functions which are called during the first second or two. (The whole process takes about a minute so in most cases it is not a problem).
Is there any way of doing this automatically, so I can save these couple of seconds and start the attachment earlier?
You can attach to the parent process and catch forks. Don't forget to set follow-fork-mode child.

Resources