bash: stop subshell script marked as failed if one step exits with an error - bash

I am running a script through the SLURM job scheduler on HPC.
I am invoking a subshell script through a master script.
The subshell script contains several steps. One step in the script sometimes fails because of the quality of the data; this step is not required for further steps, but if this step fails, my whole subshell script is marked with "failed" Status in the job scheduler. However, I need this subshell script to have a "completed" Status in the Job scheduler as it is dependency in my master script.
I tried setting up
set +e
in my subshell script right before the optional step, but it doesn't seem to work: I still get an exitCode with errors and FAILED status inthe job scheduler.
In short: I need the subshell script to have Status "completed" in the job scheduler, no matter whether one particular step is finished with errors or not. Will appreciate help with this.

For Slurm jobs submitted with sbatch, the job exit code is taken to be the return code of the submission script itself. The return code of a Bash script is that of the last command in the script.
So if you just end your script with exit 0, Slurm should consider it COMPLETED no matter what.

Related

In Travis, is it possible to mark a staged as "Cancelled" instead of "Failed" when running a bash script?

There does exist a "Cancelled" state, which you can invoke by clicking on small x next to the job. This is how a cancelled job looks:
Is it possible to enter this cancelled state when running a bash script invoked by your .travis.yml? From Travis docs:
If script returns a non-zero exit code, the build is failed
So returning a different error code doesn't help. Is it just not doable?

SLURM status string on job completion / exit

How do I get the slurm job status (e.g. COMPLETED, FAILED, TIMEOUT, ...) on job completion (within the submission script)?
I.e. I want to write to separately keep track of jobs which are timed out / failed.
Currently I work with the exit code, however jobs which TIMEOUT also get exit code 0.
For future reference, here is how I finally do it.
To retrieve the jobid at the beginning of the job and write some information (e.g. "${SLURM_JOB_ID} ${PWD}") to a summary file.
Then process this file and use something like sacct -X -n -o State --j ${jid} to get the job status.

LSF - BSUB Running a script if the job is killed

Im working with the LSF, running bsub commands.
I'm implementing the -Ep switch to run a post exec script. This works great until the Job is killed or hits a memory limit, run limit etc.
Is there any way for the job to detect its running out of resource and then run the script? or to force it to run the script even if its been killed?
I guess my other option is running job with a dependency on that job which will run the "post exec" script when it finishes.
Any thoughts?
Kind Regards,
TheBigPeeler
From the documentation, you should be seeing the behaviour that you want.
A post-execution command runs after the job finishes, regardless of
the exit state of the job. Once a post-execution command is associated
with a job, that command runs even if the job fails. You cannot
configure the post-execution command to run only under certain
conditions.
I thought that maybe the interaction with JOB_INCLUDE_POSTEXEC (lsb.params) could account for the difference, but from my test the post-exec still runs in both cases. I used runlimit (bsub -W) to trigger the job kill.
Is it possible that the post exec is running, but exits early?
What version of LSF are you using? (What's the output of mbatchd -V and sbatchd -V)

How can i check the exit code of individual process running in parallel executed by GNU Parallel

I am having an array in linux shell script. Array contains list of commands in bash shell script.
For instance :
args =( "ls","mv /abc/file1 /xyz/file2","hive -e 'select * from something'" )
Now I am executing these commands in array using GNU parallel as bellow
parallel ::: "${args[#]}"
I want to check the status code of individual process when they finish. I am aware that $? will give me the number of process which have failed but I want to know the exit code of individual process. How can I catch the exit codes of individual processes executed in GNU parallel?
Use the --halt 1 option, which makes parallel quit on the halting command, while returning it's exit code. From man parallel:
--halt-on-error val
--halt val
How should GNU parallel terminate if one of more jobs fail?
0 Do not halt if a job fails. Exit status will be the
number of jobs failed. This is the default.
1 Do not start new jobs if a job fails, but complete the
running jobs including cleanup. The exit status will be
the exit status from the last failing job.
2 Kill off all jobs immediately and exit without cleanup.
The exit status will be the exit status from the
failing job.
1-99% If val% of the jobs fail and minimum 3: Do not start
new jobs, but complete the running jobs including
cleanup. The exit status will be the exit status from
the last failing job.
--joblog logfile
Logfile for executed jobs. Save a list of the executed jobs to logfile in the following TAB separated format: sequence number, sshlogin, start time as seconds since epoch, run time in seconds, bytes in files transferred, bytes in files returned, exit status, signal, and command run.

Autosys job not failing when the shell script fails

I am moving existing manual shell scripts to execute via autosys jobs. However, after adding exit 1 for each failed autosys job; it is not failing and autosys shows exit code as 0.
I tried the below simple script
#!/bin/ksh
exit 1;
When I execute this, the autosys job shows a success status.I have not updated success code or max success code in autosys, everything is default. What am I missing?

Resources