Start independent job steps and keep track of highest exit code - bash

I want to start many independent tasks (job steps) as part of one job and want to keep track of the highest exit code of all these tasks.
Inspired by this question I am currently doing something like
#SBATCH stuf....
for i in {1..3}; do
srun -n 1 ./myprog ${i} >& task${i}.log &
done
wait
in my jobs.sh, which I sbatch, to start my tasks.
How can I define a variable exitcode which, after the wait command, contains the highest exit code of all the tasks?
Thanks so much in advance!

You can store jobs' pids in an array and wait for each one, like this
#SBATCH stuf....
for i in {1..3}; do
srun -n 1 ./myprog ${i} >& task${i}.log &
pids+=($!)
done
for pid in ${pids[#]}; do
wait $pid
exitcode=$[$? > exitcode ? $? : exitcode]
done
echo $exitcode

You can use GNU parallel to your advantage in such case:
#SBATCH stuf....
parallel --joblog ./jobs.log -P 3 "srun -n1 --exclusive ./myprog {} >& task{}.log " ::: {1..3}
This will run srun ./mprog three times with arguments respectively 1, 2 and 3, and redirect the output to three files names task1.log, task2.log and task3.log, just like your for-loop does.
With the --joblog option, it will furthermore create a file jobs.log that will contain some information about each run, among which is the exit code, in column 7. You can then extract the maximum with
awk 'NR>1 {print $7}' jobs.log | sort -n | tail -1

Related

SLURM sbatch script not running all srun commands in while loop

I'm trying to submit multiple jobs in parallel as a preprocessing step in sbatch using srun. The loop reads a file containing 40 file names and uses "srun command" on each file. However, not all files are being sent off with srun and the rest of the sbatch script continues after the ones that did get submitted finish. The real sbatch script is more complicated and I can't use arrays with this so that won't work. This part should be pretty straightforward though.
I made this simple test case as a sanity check and it does the same thing. For every file name in the file list (40) it creates a new file containing 'foo' in it. Every time I submit the script with sbatch it results in a different number of files being sent off with srun.
#!/bin/sh
#SBATCH --job-name=loop
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
#SBATCH -A zheng_lab
#SBATCH -p exacloud
#SBATCH --error=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.err
#SBATCH --output=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.out
DIR=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel
SAMPLES=$DIR/samples.txt
OUT_DIR=$DIR/test_out
FOO_FILE=$DIR/foo.txt
# Create output directory
srun -N 1 -n 1 -c 1 mkdir $OUT_DIR
# How many files to run
num_files=$(srun -N 1 -n 1 -c 1 wc -l $SAMPLES)
echo "Number of input files: " $num_files
# Create a new file for every file in listing (run 5 at a time, 1 for each node)
while read F ;
do
fn="$(rev <<< "$F" | cut -d'/' -f 1 | rev)" # Remove path for writing output to new directory
echo $fn
srun -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &
done <$SAMPLES
wait
# How many files actually got created
finished=$(srun -N 1 -n 1 -c 1 ls -lh $OUT_DIR/*out | wc -l)
echo "Number of files submitted: " $finished
Here is my output log file the last time I tried to run it:
Number of input files: 40 /home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/samples.txt
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
Number of files submitted: 8
The issue is that srun redirects its stdin to the tasks it starts, and therefore the contents of $SAMPLES is consumed, in an unpredictable way, by all the cat commands that are started.
Try with
srun --input none -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &
The --input none parameter will tell srun to not mess with stdin.

Is there a way to stop scripts that are running simultaneously if one of them send an echo?

I need to find if a value (actually it's more complex than that) is in one of 20 servers I have. And I need to do it as fast as possible. Right now I am sending the scripts simultaneously to all the servers. My main script is something like this (but with all the servers):
#!/bin/sh
#mainScript.sh
value=$1
c1=`cat serverList | sed -n '1p'`
c2=`cat serverList | sed -n '2p'`
sh find.sh $value $c1 & sh find.sh $value $c2
#!/bin/sh
#find.sh
#some code here .....
if [ $? -eq 0 ]; then
rm $tempfile
else
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
echo "$myValue"
fi
So the script only returns a response if it finds the value in the server. I would like to know if there is a way to stop executing the other scripts if one of them already return a value.
I tried adding an "exit" on the find.sh script but it won't stop all the scripts. Can somebody please tell me if what I want to do is possible?
I would suggest that you use something that can handle this for you: GNU Parallel. From the linked tutorial:
If you are looking for success instead of failures, you can use success. This will finish as soon as the first job succeeds:
parallel -j2 --halt now,success=1 echo {}\; exit {} ::: 1 2 3 0 4 5 6
Output:
1
2
3
0
parallel: This job succeeded:
echo 0; exit 0
I suggest you start by modifying your find.sh so that its return code depends on its success, that will let us identify a successful call more easily; for instance:
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
success=$?
echo "$myValue"
exit $success
To terminate all the find.sh processes spawned by your script you can use pkill with a Parent Process ID criteria and a command name criteria :
pkill -P $$ find.sh # $$ refers to the current process' PID
Note that this requires that you start the find.sh script directly rather than passing it as a parameter to sh. Normally that shouldn't be a problem, but if you have a good reason to call sh rather than your script, you can replace find.sh in the pkill command by sh (assuming you're not spawning other scripts you wouldn't want to kill).
Now that find.sh exits with success only when it finds the expected string, you can plug the two actions with && and run the whole thing in background :
{ find.sh $value $c1 && pkill -P $$ find.sh; } &
The first occurrence of find.sh that terminates with success will invoke the pkill command that will terminate all others (those killed processes will have non-zero exit codes and therefore won't run their associated pkill).

running each element in array in parallel in bash script

Lets say I have a bash script that looks like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
command --arg1 $each
done
If I want to run the everything in the loop in parallel, I could just change command --arg1 $each to command --arg1 $each &.
But now lets say I want to take the results of command --arg1 $each and do something with those results like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
done
If I just add a & to the end of command --arg1 $each, everything after command --arg1 $each will run without command --arg1 $each finishing first. How do I prevent that from happening? Also, how do I also limit the amount of threads the loop can occupy?
Essentially, this block should run in parallel for 1,2,3,4,5,6
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
-----EDIT--------
Here is the original code:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
get_lags () {
echo "$1"
lags=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --group $1 --command-config /etc/kafka/client.properties --new-consumer))
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
get_lags $each &
done
------EDIT 2-----------
Trying with answer below:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
max_proc_count=8
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$'\n' read -r -d '' -a lags < <(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --command-config /etc/kafka/client.properties --new-consumer --group "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
The convenient thing to do is to push your background code into a separate script -- or an exported function. That way xargs can create a new shell, and access the function from its parent. (Be sure to export any other variables that need to be available in the child as well).
array=( 1 2 3 4 5 6 )
max_proc_count=8
log_file=out.txt
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$' \t\n' read -r -d '' -a lags < <(yourcommand --arg1 "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
Some notes:
Using echo -e is bad form. See the APPLICATION USAGE and RATIONALE sections in the POSIX spec for echo, explicitly advising using printf instead (and not defining an -e option, and explicitly defining than echo must not accept any options other than -n).
We're including the each value in the log file so it can be extracted from there later.
You haven't specified whether the output of yourcommand is space-delimited, tab-delimited, line-delimited, or otherwise. I'm thus accepting all these for now; modify the value of IFS passed to the read to taste.
printf '%(...)T' to get a timestamp without external tools such as date requires bash 4.2 or newer. Replace with your own code if you see fit.
read -r -a arrayname < <(...) is much more robust than arrayname=( $(...) ). In particular, it avoids treating emitted values as globs -- replacing *s with a list of files in the current directory, or Foo[Bar] with FooB should any file by that name exist (or, if the failglob or nullglob options are set, triggering a failure or emitting no value at all in that case).
Redirecting stdout to your log_file once for the entire loop is somewhat more efficient than redirecting it every time you want to run printf once. Note that having multiple processes writing to the same file at the same time is only safe if all of them opened it with O_APPEND (which >> will do), and if they're writing in chunks small enough to individually complete as single syscalls (which is probably happening unless the individual lags values are quite large).
A lot of lenghty and theoretical answers here, I'll try to keep it simple - what about using | (pipe) to connect the commands as usual ?;) (And GNU parallel, which excels for these type of tasks).
seq 6 | parallel -j4 "command --arg1 {} | command2 > results/{}"
The -j4 will limit number of threads (jobs) as requested. You DON'T want to write to a single file from multiple jobs, output one file per job and join them after the parallel processing is finished.
Using GNU Parallel it looks like this:
array=( 1 2 3 4 5 6 )
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output
GNU Parallel makes sure output from different jobs is not mixed.
If you prefer the output from jobs mixed:
parallel -0 --bar --line-buffer --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output-linebuffer
Again GNU Parallel makes sure to only mix with full lines: You will not see half a line from one job and half a line from another job.
It also works if the array is a bit more nasty:
array=( "new
line" 'quotes" '"'" 'echo `do not execute me`')
Or if the command prints long lines half-lines:
command() {
echo Input: "$#"
echo '" '"'"
sleep 1
echo -n 'Half a line '
sleep 1
echo other half
superlong_a=$(perl -e 'print "a"x1000000')
superlong_b=$(perl -e 'print "b"x1000000')
echo -n $superlong_a
sleep 1
echo $superlong_b
}
export -f command
GNU Parallel strives to be a general solution. This is because I have designed GNU Parallel to care about correctness and try vehemently to deal correctly with corner cases, too, while staying reasonably fast.
GNU Parallel guards against race conditions and does not split words in the output on each their line.
array=( $(seq 30) )
max_proc_count=30
command() {
# If 'a', 'b' and 'c' mix: Very bad
perl -e 'print "a"x3000_000," "'
perl -e 'print "b"x3000_000," "'
perl -e 'print "c"x3000_000," "'
echo
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
# 'abc' should always stay together
# and there should only be a single line per job
cat parallel.out | tr -s abc
GNU Parallel works fine if the output has a lot of words:
array=(1)
command() {
yes "`seq 1000`" | head -c 10M
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
GNU Parallel does not eat all your memory - even if the output is bigger than your RAM:
array=(1)
outputsize=1000M
export outputsize
command() {
yes "`perl -e 'print \"c\"x30_000'`" | head -c $outputsize
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
You know how to execute commands in separate processes. The missing part is how to allow those processes to communicate, as separate processes cannot share variables.
Basically, you must chose whether to communicate using regular files, or inter-process communication/FIFOs (which still boils down to using files).
The general approach :
Decide how you want to present tasks to be executed. You could have them as separate files on the filesystem, as a FIFO special file that can be read from, etc. This could be a simple as writing to a separate file each command to be executed, or writing each command to a FIFO (one command per line).
In the main process, prepare the files describing tasks to perform or launch a separate process in the background that will feed the FIFO.
Then, still in the main process, launch worker processes in the background (with &), as many of them as you want parallel tasks being executed (not one per task to perform). Once they have been launched, use wait to, well, wait until all processes are finished. Separate processes cannot share variables, you will have to write any output that needs to be used later to separate files, or a FIFO, etc. If using a FIFO, remember more than one process can write to a FIFO at the same time, so use some kind of mutex mechanism (I suggest looking into the use of mkdir/rmdir for that purpose).
Each worker process must fetch the next task (from a file/FIFO), execute it, generate the output (to a file/FIFO), loop until there are no new tasks, then exit. If using files, you will need to use a mutex to "reserve" a file, read it, and then delete it to mark it as taken care of. This would not be needed for a FIFO.
Depending on the case, your main process may have to wait until all tasks are finished before handling the output, or in some cases may launch a worker process that will detect and handle output as it appears. This worker process would have to either be stopped by the main process once all tasks have been executed, or figure out for itself when all tasks have been executed and exit (while being waited on by the main process).
This is not detailed code, but I hope it gives you an idea of how to approach problems like this.
(Community Wiki answer with the OP's proposed self-answer from the question -- now edited out):
So here is one way I can think of doing this, not sure if this is the most efficient way and also, I can't control the amount of threads (I think, or processes?) this would use:
array=( 1 2 3 4 5 6 )
lag_func () {
echo "$1"
lags=($(command --arg1 $1)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
lag_func $each &
done

xargs parallel - capturing exit code

I have a shell script that parses a flatfile and for each line in it, executes a hive script in parallel.
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
Question is how can I capture the exit codes from each parallel process, so even if one child process fails, exit the script at the end with the error code.
Unfortunately I can't use gnu parallel.
I suppose that you look for something fancier, but a simple solution is to store possible errors in a tmp file and look it up afterwards:
FilewithErrors=/tmp/errors.txt
FinalError=0
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql || echo $args1 > $FilewithErrors" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
if [ -e $FilewithErrors ]; then FinalError=1; fi
rm $FilewithErrors
return $FinalError
As per the comments: Use GNU Parallel installed as a personal or minimal installation as described in http://git.savannah.gnu.org/cgit/parallel.git/tree/README
From man parallel
EXIT STATUS
Exit status depends on --halt-on-error if one of these are used: success=X,
success=Y%, fail=Y%.
0 All jobs ran without error. If success=X is used: X jobs ran without
error. If success=Y% is used: Y% of the jobs ran without error.
1-100 Some of the jobs failed. The exit status gives the number of failed jobs.
If Y% is used the exit status is the percentage of jobs that failed.
101 More than 100 jobs failed.
255 Other error.
If you need the exact error code (and not just whether the job failed or not) use: --joblog mylog.
You can probably do something like:
cat ../path/TableNames.txt |
parallel --colsep '\t' --halt now,fail=1 hive -hiveconf tableName={1} -f ../hive/LoadTables.hql '2>' ../path/LogFile-{1}
fail=1 will stop spawning new jobs if one job fails, and exit with the exit code from the job.
now will kill the remaining jobs. If you want the remaining jobs to exit of "natural causes", use soon instead.

How to run a fixed number of processes in a loop?

I have a script like this:
#!/bin/bash
for i=1 to 200000
do
create input file
run ./java
done
I need to run a number (8 or 16) of processes (java) at the same time and I don't know how. I know that wait could help but it should be running 8 processes all the time and not wait for the first 8 to finish before starting the other 8.
bash 4.3 added a useful new flag to the wait command, -n, which causes wait to block until any single background job, not just the members of a given subset (or all), to complete.
#!/bin/bash
cores=8 # or 16, or whatever
for ((i=1; i <= 200000; i++))
do
# create input file and run java in the background.
./java &
# Check how many background jobs there are, and if it
# is equal to the number of cores, wait for anyone to
# finish before continuing.
background=( $(jobs -p) )
if (( ${#background[#]} == cores )); then
wait -n
fi
done
There is a small race condition: if you are at maximum load but a job completes after you run jobs -p, you'll still block until another job
completes. There's not much you can do about this, but it shouldn't present too much trouble in practice.
Prior to bash 4.3, you would need to poll the set of background jobs periodically to see when the pool dropped below your threshold.
while :; do
background=( $(jobs -p))
if (( ${#background[#]} < cores )); then
break
fi
sleep 1
done
Use GNU Parallel like this, simplified to 20 jobs rather than 200,000 and the first job is echo rather than "create file" and the second job is sleep rather than "java".
seq 1 20 | parallel -j 8 -k 'echo {}; sleep 2'
The -j 8 says how many jobs to run at once. The -k says to keep the output in order.
Here is a little animation of the output so you can see the timing/sequence:
With a non-ancient version of GNU utilities or on *BSD/OSX, use xargs with the -P option to run processes in parallel.
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 mytask
where mytask is an auxiliary script, with the sequence number (the input line) available as the argument$1`:
#!/bin/bash
echo "Task number $1"
create input file
run ./java
You can put everything in one script if you want:
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 sh -c '
echo "Task number $1"
create input file
run ./java
' mytask
If your system doesn't have seq, you can use the bash snippet
for ((i=1; i<=200000; i++)); do echo "$i"; done
or other shell tools such as
awk '{for (i=1; i<=200000; i++) print i}' </dev/null
or
</dev/zero tr '\0' '\n' | head -n 200000 | nl
Set up 8 subprocesses that read from a common stream; each subprocess reads one line of input and starts a new job whenever its current job completes.
forker () {
while read; do
# create input file
./java
done
}
cores=8 # or 16, or whatever
for ((i=1; i<=200000; i++)); do
echo $i
done | while :; do
for ((j=0; j< cores; j++)); do
forker &
done
done
wait # Waiting for the $core forkers to complete

Resources