Bash Jobs : how can I output job process information along with errors? - bash

I'm using a bash script to import some data into a MySQL database. I'm parallelising the loads using jobs in...
for i in *.sql; do
{
echo "importing $i"
mysql --defaults-extra-file=my.cnf < $i
echo "$i load complete
} &
done
If one of the jobs errors, I see the error in stdout, but nothing to identify which of the jobs caused it. Is there a way to add the job process information to the error output so I which which one it was?

You can try using GNU Parallel:
parallel --progress --tag 'mysql --defaults-extra-file=my.cnf < {}' ::: *.sql > import.log
This will run the mysql --defaults-extra-file=my.cnf as many times as there are .sql files in the current directory, in parallel. You can decide on the maximum number of parallel processes with the --jobs option.
The --tag option prefixes each output line with the file being processed. The output is redirected to a file for further reference, and the --bar options shows a progress bar for a visual feedback.

Related

How to run an Inotify shell script as an asynchronous process

I have an inotify shell script which monitors a directory, and executes certain commands if a new file comes in. I need to make this inotify script into a parallelized process, so the execution of the script doesn't wait for the process to complete whenever multiple files comes into the directory.
I have tried using nohup, & and xargs to achieve this task. But the problem was, xargs runs the same script as a number of processes, whenever a new file comes in, all the running n processes try to process the script. But essentially I only want one of the processes to process the new file whichever is idle. Something like worker pool, whichever worker is free or idle tries to execute the task.
This is my shell script.
#!/bin/bash
# script.sh
inotifywait --monitor -r -e close_write --format '%w%f' ./ | while read FILE
do
echo "started script";
sleep $(( $RANDOM % 10 ))s;
#some more process which takes time when a new file comes in
done
I did try to execute the script like this with xargs =>
xargs -n1 -P3 bash sample.sh
So whenever a new file comes in, it is getting processed thrice because of P3, but ideally i want one of the processes to pick this task which ever is idle.
Please shed some light on how to approach this problem?
There is no reason to have a pool of idle processes. Just run one per new file when you see new files appear.
#!/bin/bash
inotifywait --monitor -r -e close_write --format '%w%f' ./ |
while read -r file
do
echo "started script";
( sleep $(( $RANDOM % 10 ))s
#some more process which takes time when a new "$file" comes in
) &
done
Notice the addition of & and the parentheses to group the sleep and the subsequent processing into a single subshell which we can then background.
Also, notice how we always prefer read -r and Correct Bash and shell script variable capitalization
Maybe this will work:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor
If you have a dir in which users drop files that needs to be processed you can do this on GNU/Linux (If you know what inotifywait is called on other platforms file a bug report):
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
parallel -u echo
This will run the command echo on each file put into my_dir or subdirs of my_dir.
To run at most 5 processes use -j5.

question on using bwait to wait for multiple bsub jobs to finish

I am new to using LSF (been using PBS/Torque all along).
I need to write code/logic to make sure all bsub jobs finish before other commands/jobs can be fired.
Here is what I have done: I have a master shell script which calls multiple other shell scripts via bsub commands. I capture the job ids from bsub in a log file and I need to ensure that all jobs get finished before the downstream shell script should execute its other commands.
Master shell script
#!/bin/bash
...Code not shown for brevity..
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
Need Code logic to use bwait before downstream Commands can be used
"Command 2 will be invoked with multiple bsubs" > log_cmd_2.txt
and so on
stdout captured from Command 1 within the Master Shell script is stored in log_cmd_1.txt which looks like this
Submitting Sample 101
Job <545> is submitted to .
Submitting Sample 102
Job <546> is submitted to .
Submitting Sample 103
Job <547> is submitted to .
Submitting Sample 104
Job <548> is submitted to .
I have used the codeblock shown below after Command 1 in the master shell script.
However, it does not seem to work for my situation. Looks like I would have gotten the whole thing wrong below.
while sleep 30m;
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' <path_to>/log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo $res; done 1>
<path_to>/running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s $WORK_DIR/logs/running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
The question: What's the correct way to do this using bwait within the master shell script.
Thanks in advance.
bwait will block until the condition is satisfied, so the loops are probably not neecessary. Note that since you're using done, if the job fails then bwait will exit and inform you that the condition can never be satisfied. Make sure to check that case.
What you have should work. At least the following test worked for me.
#!/bin/bash
# "Command 1 invoked with multiple bsubs" > log_cmd_1.txt
( bsub sleep 0; bsub sleep 0 ) > log_cmd_1.txt
# Need Code logic to use bwait before downstream Commands can be used
while sleep 1
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo "$res"; done 1> running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
Another way to do it. Which may is a little cleaner, is to use job arrays and job dependencies. Job arrays will combine several pieces of work that can be managed as a single job. So your
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
could be submitted as a single job array. You'll need a driver script that can launch the individual jobs. Here's an example driver script.
$ cat runbatch1.sh
#!/bin/bash
# $LSB_JOBINDEX goes from 1 to 10
if [ "$LSB_JOBINDEX" -eq 1 ]; then
# do the work for job batch 1, job 1
...
elif [ "$LSB_JOBINDEX" -eq 2 ]; then
# etc
...
fi
Then you can submit the job array like this.
bsub -J 'batch1[1-10]' sh runbatch1.sh
This command will run 10 job array elements. The driver script's environment will use the variable LSB_JOB_INDEX to let you know which element the driver is running. Since the array has a name, batch, it's easier to manage. You can submit a second job array that won't start until all elements of the first have completed successfully. The second array is submitted with this command.
bsub -w 'done(batch1)' -J 'batch2[1-10]' sh runbatch2.sh
I hope that this helps.

BASH - transfer large files and process after transfer limiting the number of processes

I have several large files that I need to transfer to a local machine and process. The transfer takes about as long as the processing of the file, and I would like to start processing it immediately after it transfers. But the processing could take longer than the transfer, and I don't want the processes to keep building up, but I would like to limit it to some number, say 4.
Consider the following:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
scp user#host:$FILE ./
myCommand $FILE &
done
This will transfer each file and start processing it after the transfer while allowing the next file to start transferring. However, if myCommand $FILE takes much longer than the time to transfer one file, these could keep piling up and bogging down the local machine. So I would like to limit myCommand to maybe 2-4 parallel instances. Subsequent attempts to invoke myCommand should buffer it until a "slot" is open. Is there a good way to do this in BASH (using xargs or other utilities is acceptable).
UPDATE:
Thanks for the help in getting this far. Now I'm trying to implement the following logic:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
echo "Starting on $FILE" # should go to terminal output
scp user#host:$FILE ./
echo "Processing $FILE" # should go to terminal output
echo $FILE # should go through pipe to parallel
done | parallel myCommand
You can use GNU Parallel for that. Just echo the commands you want run into parallel and it will run one job per CPU core your machine has.
for f in ... ; do
scp ...
echo ./process "$f"
done | parallel
If you specifically want 4 processes at a time, use parallel -j 4.
If you want a progress bar, use parallel --bar.
Alternatively, echo just the filename with null-termination, and add the processing command into the invocation of parallel:
for f in ... ; do
scp ...
printf "%s\0" "$f"
done | parallel -0 -j4 ./process

xargs parallel - capturing exit code

I have a shell script that parses a flatfile and for each line in it, executes a hive script in parallel.
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
Question is how can I capture the exit codes from each parallel process, so even if one child process fails, exit the script at the end with the error code.
Unfortunately I can't use gnu parallel.
I suppose that you look for something fancier, but a simple solution is to store possible errors in a tmp file and look it up afterwards:
FilewithErrors=/tmp/errors.txt
FinalError=0
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql || echo $args1 > $FilewithErrors" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
if [ -e $FilewithErrors ]; then FinalError=1; fi
rm $FilewithErrors
return $FinalError
As per the comments: Use GNU Parallel installed as a personal or minimal installation as described in http://git.savannah.gnu.org/cgit/parallel.git/tree/README
From man parallel
EXIT STATUS
Exit status depends on --halt-on-error if one of these are used: success=X,
success=Y%, fail=Y%.
0 All jobs ran without error. If success=X is used: X jobs ran without
error. If success=Y% is used: Y% of the jobs ran without error.
1-100 Some of the jobs failed. The exit status gives the number of failed jobs.
If Y% is used the exit status is the percentage of jobs that failed.
101 More than 100 jobs failed.
255 Other error.
If you need the exact error code (and not just whether the job failed or not) use: --joblog mylog.
You can probably do something like:
cat ../path/TableNames.txt |
parallel --colsep '\t' --halt now,fail=1 hive -hiveconf tableName={1} -f ../hive/LoadTables.hql '2>' ../path/LogFile-{1}
fail=1 will stop spawning new jobs if one job fails, and exit with the exit code from the job.
now will kill the remaining jobs. If you want the remaining jobs to exit of "natural causes", use soon instead.

Hold remainder of shell script commands until PBS qsub array job completes

I am very new to shell scripting, and I am trying to write a shell pipeline that submits multiple qsub jobs, but has several commands to run in between these qsubs, which are contingent on the most recent job completing. I have been researching multiple ways to try and hold the shell script from proceeding after submission of a qsub job, but none have been successful.
The simplest chunk of code I can provide to illustrate the issue is as follows:
THREADS=`wc -l < list1.txt`
qsub -V -t 1-$THREADS firstjob.sh
echo "firstjob.sh completed"
There are obviously other lines of code after this that are actually contingent on firstjob.sh finishing, but I have omitted them here for clarity. I have tried the following methods of pausing/holding the script:
1) Only using wait, which is supposed to stop the script until all background programs are completed. This pushed right past the wait and printed the echo statement to the terminal while the array job was still running. My guess is this is occurring because once the qsub job is submitted, is exits and wait thinks it has completed?
qsub -V -t 1-$THREADS firstjob.sh
wait
echo "firstjob.sh completed"
2) Setting the job to a variable, echoing that variable to submit the job, and using the the entire job ID along with wait to pause. The echo command should wait until all elements of the array job have completed.The error message is shown following the code, within the code block.
job1=$(qsub -V -t 1-$THREADS firstjob.sh)
echo "$job1"
wait $job1
echo "firstjob.sh completed"
####ERROR RECEIVED####
-bash: wait: `4585057[].cluster-name.local': not a pid or valid job spec
3) Using the -sync y for qsub. This should prevent it from exiting the qsub until the job is complete, acting as an effective pause...I had hoped. Error in comment after the commands. For some reason it is not reading the -sync option correctly?
qsub -V -sync y -t 1-$THREADS firstjob.sh
echo "firstjob.sh completed"
####ERROR RECEIVED####
qsub: script file 'y' cannot be loaded - No such file or directory
4) Using a dummy shell script (the dummy just makes an empty file) so that I could use the -W depend=afterok: option of qsub to pause the script. This again pushes right past to the echo statement without any pause for submitting the dummy script. Both jobs get submitted, one right after the other, no pause.
job1=$(qsub -V -t 1-$THREADS demux.sh)
echo "$job1"
check=$(qsub -V -W depend=afterok:$job1 dummy.sh)
echo "$check"
echo "firstjob.sh completed"
Some further details regarding the script:
Each job submission is an array job.
The pipeline is being run in the terminal using a command resembling the following, so that I may provide it with 3 inputs: source Pipeline.sh -r list1.txt -d /workingDir/ -s list2.txt
I am certain that the firstjob.sh has not actually completed running because I see them in the queue when I use showq.
Perhaps there is an easy fix in most of these scenarios, but being new to all this, I am really struggling. I have to use this method in 8-10 places throughout the script, so it is really hindering progress. Would appreciate any assistance. Thanks.
POST EDIT 1
Here is the code contained in firstjob.sh...though doubtful that it will help. Everything in here functions as expected, always produces the correct results.
\#! /bin/bash
\#PBS -S /bin/bash
\#PBS -N demux
\#PBS -l walltime=72:00:00
\#PBS -j oe
\#PBS -l nodes=1:ppn=4
\#PBS -l mem=15gb
module load biotools
cd ${WORKDIR}/rawFQs/
INFILE=`head -$PBS_ARRAYID ${WORKDIR}${RAWFQ} | tail -1`
BASE=`basename "$INFILE" .fq.gz`
zcat $INFILE | fastx_barcode_splitter.pl --bcfile ${WORKDIR}/rawFQs/DemuxLists/${BASE}_sheet4splitter.txt --prefix ${WORKDIR}/fastqs/ --bol --suffix ".fq"
I just tried using -sync y, and that worked for me, so good idea there... Not sure what's different about your setup.
But a couple other things you could try involve your main script knowing the status of the qsub jobs you're running. One idea is that you could have your main script check the status of your job using qstat and wait until it finishes before proceeding.
Alternatively, you could have the first job write to a file as its last step (or, as you suggested, set up a dummy job that waits for the first job to finish). Then in your main script, you can test to see whether that file has been written before going on.

Resources