How can i check the exit code of individual process running in parallel executed by GNU Parallel - bash

I am having an array in linux shell script. Array contains list of commands in bash shell script.
For instance :
args =( "ls","mv /abc/file1 /xyz/file2","hive -e 'select * from something'" )
Now I am executing these commands in array using GNU parallel as bellow
parallel ::: "${args[#]}"
I want to check the status code of individual process when they finish. I am aware that $? will give me the number of process which have failed but I want to know the exit code of individual process. How can I catch the exit codes of individual processes executed in GNU parallel?

Use the --halt 1 option, which makes parallel quit on the halting command, while returning it's exit code. From man parallel:
--halt-on-error val
--halt val
How should GNU parallel terminate if one of more jobs fail?
0 Do not halt if a job fails. Exit status will be the
number of jobs failed. This is the default.
1 Do not start new jobs if a job fails, but complete the
running jobs including cleanup. The exit status will be
the exit status from the last failing job.
2 Kill off all jobs immediately and exit without cleanup.
The exit status will be the exit status from the
failing job.
1-99% If val% of the jobs fail and minimum 3: Do not start
new jobs, but complete the running jobs including
cleanup. The exit status will be the exit status from
the last failing job.

--joblog logfile
Logfile for executed jobs. Save a list of the executed jobs to logfile in the following TAB separated format: sequence number, sshlogin, start time as seconds since epoch, run time in seconds, bytes in files transferred, bytes in files returned, exit status, signal, and command run.

Related

bash: stop subshell script marked as failed if one step exits with an error

I am running a script through the SLURM job scheduler on HPC.
I am invoking a subshell script through a master script.
The subshell script contains several steps. One step in the script sometimes fails because of the quality of the data; this step is not required for further steps, but if this step fails, my whole subshell script is marked with "failed" Status in the job scheduler. However, I need this subshell script to have a "completed" Status in the Job scheduler as it is dependency in my master script.
I tried setting up
set +e
in my subshell script right before the optional step, but it doesn't seem to work: I still get an exitCode with errors and FAILED status inthe job scheduler.
In short: I need the subshell script to have Status "completed" in the job scheduler, no matter whether one particular step is finished with errors or not. Will appreciate help with this.
For Slurm jobs submitted with sbatch, the job exit code is taken to be the return code of the submission script itself. The return code of a Bash script is that of the last command in the script.
So if you just end your script with exit 0, Slurm should consider it COMPLETED no matter what.

Obtain the exit code for a known process id

I have a list of processes triggered one after the other, in parallel. And, I need to know the exit code of all of these processes when they complete execution, without waiting for all of the processes to finish.
While status=$?; echo $status would provide the exit code for the last command executed, how do I know the exit code of any completed process, knowing the process id?
You can do that with GNU Parallel like this:
parallel --halt=now,done=1 ::: ./job1 ./job2 ./job3
The --halt=now,done=1 means halt immediately, as soon as any one job is done, killing all outstanding jobs immediately and exiting itself with the exit status of the complete job.
There are options to exit on success, or on failure as well as by completion. The number of successful, failing or complete jobs can be given as a percentage too. See documentation here.
Save the background job id using a wrapper shell function. After that the exit status of each job can be queried:
#!/bin/bash
jobs=()
function run_child() {
"$#" &
jobs+=($!)
}
run_child sleep 1
run_child sleep 2
run_child false
for job in ${jobs[#]}; do
wait $job
echo Exit Code $?
done
Output:
Exit Code 0
Exit Code 0
Exit Code 1

SLURM status string on job completion / exit

How do I get the slurm job status (e.g. COMPLETED, FAILED, TIMEOUT, ...) on job completion (within the submission script)?
I.e. I want to write to separately keep track of jobs which are timed out / failed.
Currently I work with the exit code, however jobs which TIMEOUT also get exit code 0.
For future reference, here is how I finally do it.
To retrieve the jobid at the beginning of the job and write some information (e.g. "${SLURM_JOB_ID} ${PWD}") to a summary file.
Then process this file and use something like sacct -X -n -o State --j ${jid} to get the job status.

Bash files: run process in parallel and stop when one is over

I would like to start two C codes from a bash file in parallel and the second one stops when the first one has finished.
The instruction wait expects both processes to stop which is not what I would like to do.
Thanks for any suggestion.
GNU parallel can do this kind of job. Check termination section, it can shutdown down remaining processes based on the exit code (either success or failure:
parallel -j2 --halt now,success=1 ::: 'cmd1 args' 'cmd2 args'
When one of the job finishes successfully, it will send TERM signal to the other jobs (if jobs are not terminated it forces using KILL signal).
With $! you get the pid of the last command executed in parallel. See some nice examples here: Bash `wait` command, waiting for more than 1 PID to finish execution
For your peculiar problem I imagine something like:
#!/bin/bash
command_master() {
echo -e "Command_master"
sleep 1
}
command_tokill() {
echo -e "Command_tokill"
sleep 10
}
command_master & pid_master=($!)
command_tokill & pid_tokill=($!)
wait "$pid_master"
kill "$pid_tokill"
wait -n is what you are looking for. It waits for the next job to finish. You can then have a list of the PIDs of the remaining jobs with jobs -p if you want to kill them.
prog1 & pids=( $! )
prog2 & pids+=( $! )
wait -n
kill "${pids[#]}"
This requires bash.
The two programs are started as background jobs, and the shell waits for one of them to exit.
When this happens, kill is used to terminate both processes (this will cause an error since one of them is already dead).

Listen to background process's exit code in MakeFile

Solved
I need to spawn background processes in MakeFile and also consider their exit codes.
Scenario:
several processes are spawned in background.
MakeFile continue evaluation (and do not want to check spawned processes PIDs in some loop an so forth)
Some process exits with non zero exit code
make utility exits with non zero exit code
Naturally, I am consider to use command & to spawn a process in background.
Problem: If command is specified like command & then make process does not track it's exit code.
Sample 1
do:
#false & \
echo "all is normal"
%make -f exit_status_test.mk
all is normal
Sample 2
do:
#false && \
echo "all is normal"
%make -f exit_status_test.mk
*** Error code 1
Stop in /usr/home/scher/tmp/lock_testing.
Sample 1 shows that make utility does not consider exit code of the background process.
P.S. Please do not advice to store spawned processes PIDs and to check them in a loop with some sleep delay and so forth. A need to continue evaluation of MakeFile and exit with non zero code automatically.
Solution
do:
#(echo "background command" ; (echo "[HANDLER] Prev command exits with $$?")) & \
echo "doing something"
So we can create a sequence of commands to handle exit status of background process.
This seems like an ill-conceived attempt to create a Makefile that can run multiple jobs in parallel, when in fact make can generally do this for you.
All you need to do is give each job a separate command in make:
target: job1 job2
job1:
some_command
job2:
some_other_command
If you use something like this in your Makefile and then run make -j2 target, then both some_command and some_other_command will be run in parallel.
See if you can find a way to get make to run your work in parallel like this.

Resources