Open Grid Scheduler/Sun Grid Engine qrsh bad exit code on halt/reboot

Open Grid Scheduler/Sun Grid Engine qrsh bad exit code on halt/reboot - exit-code

I use OGS on spot instances through qrsh calls. To have my program work properly, I need to be able to know when a job has failed due to a system shutdown (me losing the spot instance).
If we execute a remote command via ssh and the remote system goes down, the returned exit code is 255.
My problem is that with OGS, when executing a remote command with qrsh and the remote system goes down, the returned exit code is 0. 0 is meant for "ok, all good". So there is no way to know from that code that no, it's not ok and I need to reschedule.
(Of course, I could change my remote calls to return a specific code, but since it's not standard I would rather avoid that.)

Related

How can I handle a `system` call which fails?

I have a Perl script which calls an external program. (Right now I'm actually using backticks, but I could just as easily use system or something from cpan.) Sometimes the program fails, causing Windows to create a dialog box "(external program) has stopped working" with the text
Windows is checking for a solution to the problem...
shortly replaced with
A problem caused the program to stop working correctly. windows will close the program and notify you if a solution is available.
Unfortunately, this error message stops the process from dying, causing Perl to not return until the user (me!) clicks "Cancel" or "Close Program". Is there a way to avoid this behavior?
In my use case it is acceptable to have the program fail -- it does useful but strictly not necessary work. But as it needs to run unattended I can't have it block the program's remaining work.

The problem with your current approach is that backticks & system block while the external program is running/hanging. Possible other aproaches might include.
Using threads & various modules from the Win32 family to busy-wait for the process end or click on the dialong box. This is probably overkill.
Use an Alarm Signal or Event to wake up your program when the external program has taken 'too long' to respond.
Use an IPC Module to open the program and monitor it's progress.
If you don't need the child program's return value, STDOUT or STDERR, simbabque's exec option has merit, but if you need to keep a handle on the process, try Win32::Process. I've found this useful on many an occasion. The module's wait method can be an excellent alternative to my Alarm suggestion or simabque's sleep suggestion with the added benefit that your program will not sleep longer than required by the child.

If you do not need to wait for the external program to finish running to continue, you can do exec instead of system and it will never return.
You could always add a sleep $n afterwards to make it wait for the external program to theoretically finish.
exec('maybe_dies.exe');
sleep 1; # make sure it does stuff before it dies, or not, or whatever...

Check status of a forked process?

I'm running a process that will take, optimistically, several hours, and in the worst case, probably a couple of days.
I've tried a couple of times to run it and it just never seems to complete (I should add, I didn't write the program, it's just a big dataset). I know my syntax for the command is correct as I use it all the time for smaller data and it works properly (I'll spare you the details as it is obscure for SO and I don't think that relevant to the question).
Consequently, I'd like to leave the program unattended running as a fork with &.
Now, I'm not totally sure whether the process is just grinding to a halt or is running but taking much longer than expected.
Is there any way to check the progress of the process other than ps and top + 1 (to check CPU use).
My only other thought was to get the process to output a logfile and periodically check to see if the logfile has grown in size/content.
As a sidebar, is it necessary to also use nohup with a forked command?

I would use screen for this purpose. see the man for more reference
Brief summary how to use:
screen -S some_session_name - starts a new screen session named session_name
Ctrl + a + d - detach session
screen -r some_session_name returns you to your session

How to execute and manage ruby script from ruby?

I have a script named program.rb and would like to write a script named main.rb that would do the following:
system("ruby", "program.rb")
constantly check if program.rb is running until it is done
if program.rb has reached completion
exit main.rb
end
otherwise keep doing this until program.rb reaches completion{
if program.rb is not running and stopped before completing
restart program.rb from where it left off
end}
I've looked into Pidify but could not find a way to apply it to fit this exactly the right way...
Any help in how to approach this script would be greatly appreciated!
Update:
I could figure out how to resume running the script from where it left off in program.rb if there's no way to do it in main.rb

It's impossible to "restart script from where it left off" without full cooperation from the program.rb. That is, it should be able to advertise its progress (by writing current state to a file, maybe?) and be able to start correctly from a step specified in ARGV. There's no external ruby magic that can replace this functionality.
Also, if a program terminated abnormally, it means one of two things:
the error is (semi-)permanent (disk is full, no appropriate access rights to a file, etc). In this case, simply restarting the program would cause it to fail again. And again. Infinite fail loop.
the error is temporary (shaky internet connection). In this case, program should do better job with exception handling and retry on its own (instead of terminating).
In either case, there's no need for restarting, IMHO.

Well, here is one way.
Modify program.rb to take an optional flag argument --restart or something.
When program.rb starts up without this argument it will initialize a file to record its current state. Periodically, it will write whatever it needs into this file to record some kind of checkpoint.
When program.rb starts up with the restart flag, it will read its checkpoint file and start processing at that point. For this to work, it must either checkpoint all state changes or arrange for all processing between checkpoints to be idempotent so it can be repeated without ill effect.
There are lots of ways to monitor the health of program.rb. The best way is with some sort of ping, perhaps something like GET /health_check or a dummy message via a socket or pipe. You could just have a locked file to detect if the lock is still held, or you could record the PID on startup and check that it still exists.

Getting previous exit code of an application on Windows

Is there any way to find out what was the last Exit Code of an application the last time it run?
I want to check if application wasn't exit with zero exit code last time (which means abnormal termination in my case) And if so, do some checking and maybe fix/clean up previously generated data.
Since some applications do this (they give a warning and ask if you want to run in Safe Mode this time) I think maybe Windows can tell me this.
And if not, what is the best practice of doing this? Setting a flag on a file or something when application terminated correctly and check that next time it executed?

No, there's no permanent record of the exit code. It exists only as long as a handle to the process is kept open. And returned by GetExitCodeProcess(), it needs that handle. As soon as the last handle is closed then that exit code is gone for good. One technique is a little bootstrapper app that starts the process and keeps the handle. It can then also do other handy things like send alerts, keep a log, clean up partial files or record minidumps of crashes. Use WaitForSingleObject() to detect the process exit.
Btw, you definitely want to exit code number to mean the opposite thing. A zero is always the "normal exit" value. This helps you detect hard crashes. The exit code is always non-zero when Windows terminates the app forcibly, set to the exception code.
There are other ways, you can indeed create a file or registry key that indicates the process is running and check for that when it starts back up. The only real complication with it is that you need to do something meaningful when the user starts the program twice. Which is a hard problem to solve, such apps are usually single-instance apps. You use a named mutex to detect that an instance of the program is already running. Imprinting the evidence with the process ID and start time is workable.

There is no standard way to do this on the Windows Platform.
The easiest way to handle this case is to put a value on the registry and to clear it when the program exits.
If the value is still present when the program starts, then it terminated unexpectedly.
Put a value in the HKCU/Software// to be sure you have sufficient rights (the value will be per user in this case).

How to do pings in a rubyqt application so the GUI doesn't freeze?

I am writing an application, which shall work with networks.
As a GUI I am using rubyqt.
To determine if a Server is up I have to ping it (with net/ping).
But I ran in to a problem. If the server is down,
the GUI freezes for the timeout, even if I put the code in a Thread or IO.popen loop eg.
Thread.new('switch') do
if Net::PingExternal.new("195.168.255.244",timeout=0.9).ping then
down = false
else
down = true
end
end
will freeze for 0.9 seconds. As the QtThreads are not yet working with rubyqt,
does somebody have an idea to make the GUI don't freeze (apart from reducing the timeout)?
I was thinking about putting the pinging-part in an external program, which writes the status (up/down) in a file, which the actual program then reads, but this solution seems to be a bit clumsy.

Have you considered abstracting that operation from the request altogether? If you move the costly operation to an external library you could easily queue it up and execute it using something like delayed_job (http://github.com/tobi/delayed_job/tree/master) which would remove the risk of it halting the request at all.
Maybe this is what you are looking for...?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Open Grid Scheduler/Sun Grid Engine qrsh bad exit code on halt/reboot - exit-code

Related

How can I handle a `system` call which fails?

Check status of a forked process?

How to execute and manage ruby script from ruby?

Getting previous exit code of an application on Windows

How to do pings in a rubyqt application so the GUI doesn't freeze?

Categories

Resources