Execute batch jobs sequentially - sbatch

I have a batch job which looks like this
sbatch --wrap "perl test.pl file1 file2"
sbatch --wrap "perl test.pl file3 file4"
sbatch --wrap "perl test.pl file5 file6"
sbatch --wrap "perl test.pl file7 file8"
& the list goes on till
sbatch --wrap "perl test.pl file49 file50"
How can I run individual jobs sequentially?
Thanks,

You must make one script dependent on the finalization of the previous one:
JOBID=$(sbatch --wrap "perl test.pl file1 file2" | awk '{print $4}')
for N in {3..49..2}
do
JOBID=$(sbatch -d afterok:$JOBID --wrap "perl test.pl file$N file$(($N+1))" | awk '{print $4}')
done
BTW, the singleton approach suggested in a comment is also a good approach, and frees you from the JobID management harrasement.

Related

how to run bash for loop and using GNU parallel?

I have a bash loop where I am passing variables to a script. I want to run these in parallel with GNU parallel
for FILE_NAME in FILE1 FILE2 FILE3;
do
./SCRIPT -n $FILE_NAME
done
where I want the scripts to run in parallel as follows:
./SCRIPT -n FILE1
./SCRIPT -n FILE2
./SCRIPT -n FILE3
I am trying to use the GNU parallel command because it has been suggested a lot on here, but I am confused about where to put the parallel command if I am passing a variable to the script.
I have tried turning the FILE1 FILE2 FILE3 into a list:
parallel -a $FILE_LIST ./SCRIPT -n $FILE_NAME
for FILE_NAME in FILE1 FILE2 FILE3;
do
parallel ./SCRIPT -n $FILE_NAME
done
Do you have any suggestions?
Try like this:
parallel ./SCRIPT -n {} ::: FILE1 FILE2 FILE3
Or, more succinctly if your files are really named like that:
parallel ./SCRIPT -n {} ::: FILE*

Use of a HEREDOC with SLURM sbatch --wrap

I am encountering difficulties using a (Bash) HEREDOC with a SLURM sbatch submission, via --wrap.
I would like the following to work:
SBATCH_PARAMS=('--nodes=1' '--time=24:00:00' '--mem=64000' '--mail-type=ALL')
sbatch ${SBATCH_PARAMS[#]} --job-name="MWE" -o "MWE.log" --wrap <<EOF
SLURM_CPUS_ON_NODE=\${SLURM_CPUS_ON_NODE:-8}
SLURM_CPUS_PER_TASK=\${SLURM_CPUS_PER_TASK:-\$SLURM_CPUS_ON_NODE}
export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK
parallel --joblog "MWE-jobs.log" --resume --resume-failed -k --linebuffer -j \$((\$OMP_NUM_THREADS/4)) --link "MWE.sh {1} {2}" ::: "./"*R1*.fastq.gz ::: "./"*R2*.fastq.gz
EOF
On my current cluster, sbatch returns the below error, refusing to submit this job:
ERROR: option --wrap requires argument
Might anyone know how I can get this to work?
Since wrap expects a string argument, you can't use a heredoc directly. Heredocs are used when a filename is expected where it's undesirable to make one.
Use a heredoc for cat, where it does expect a filename, and use its output as the string for which --wrap expects:
SBATCH_PARAMS=('--nodes=1' '--time=24:00:00' '--mem=64000' '--mail-type=ALL')
sbatch ${SBATCH_PARAMS[#]} --job-name="MWE" -o "MWE.log" --wrap $(cat << EOF
SLURM_CPUS_ON_NODE=\${SLURM_CPUS_ON_NODE:-8}
SLURM_CPUS_PER_TASK=\${SLURM_CPUS_PER_TASK:-\$SLURM_CPUS_ON_NODE}
export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK
parallel --joblog "MWE-jobs.log" --resume --resume-failed -k --linebuffer -j \$((\$OMP_NUM_THREADS/4)) --link "MWE.sh {1} {2}" ::: "./"*R1*.fastq.gz ::: "./"*R2*.fastq.gz
EOF)
You can just use the heredoc without the wrap provided you add the #!/bin/bash at the top of it.
Adapting a related post on assigning a HEREDOC to a variable, but instead using cat (since I use errexit and want to avoid working-around the non-zero exit value of the read), I was able to submit my job as follows:
CMD_FOR_SUB=$(cat <<EOF
SLURM_CPUS_ON_NODE=\${SLURM_CPUS_ON_NODE:-8}
SLURM_CPUS_PER_TASK=\${SLURM_CPUS_PER_TASK:-\$SLURM_CPUS_ON_NODE}
export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK
parallel --joblog "MWE-jobs.log" --resume --resume-failed -k --linebuffer -j \$((\$OMP_NUM_THREADS/4)) --link "MWE.sh {1} {2}" ::: "./"*R1*.fastq.gz ::: "./"*R2*.fastq.gz
EOF
)
sbatch ${SBATCH_PARAMS[#]} --job-name="MWE" -o "MWE.log" --wrap "$CMD_FOR_SUB"
While this does appear to work, I would still prefer a solution that allows sbatch to directly accept the HEREDOC.

How to run multiple commands at the same time in bash instead of running one after another?

I need to run more than one command in bash without waiting to finishing first one and starting other command. The commands can be like this inside bash file.
#!/bin/bash
perl test.pl -i file1 -o out1
perl test.pl -i file2 -o out2
and so on
All should run at the same time in different cores instead of running one after another.
Background them with &:
#!/bin/bash
perl test.pl -i file1 -o out1 &
perl test.pl -i file2 -o out2 &
Or better yet, use GNU parallel. This will allow you to use multiple CPUs and lots more.

How to submit a cat command to a cluster using qsub and use the pipe correctly

I want to submit several "cat jobs" on the fly to a cluster using qsub. Currently I'm concatenating several files with cat to a single one at the end (using > output_file) at the end of the command.
The problem is that qsub takes the > output_file from the command as part of the qsub, putting there the log of the job instead of the cat output.
qsub -b y -cwd -q bigmem cmd1
where cmd1 looks like:
cat file1 file2 filen > output_file
Alternatively to dbeer's answer, if your code is disposable, you can use echo:
echo "cat file1 file2 ... filen > outfile" | qsub -cwd <options>
When a job is running via pbs, stdout is redirected to the output file of the job, so the way to do this is to write a script:
#!/bin/bash
cat file1 file2 ... filen
You don't need to redirect the output to a file because the mom daemon will do that for you in setting up the job, you just need to specify the output file you desire using -o. For example, if you named the above script script.sh (make sure it is executable) you'd submit:
qsub script.sh -b y -q bigmem -o output_file

run hadoop command in bash script

i need to run hadoop command in bash script, which go through bunch of folders on amazon S3, then write those folder names into a txt file, then do further process. but the problem is when i ran the script, seems no folder names were written to txt file. i wonder if it's the hadoop command took too long to run and the bash script didn't wait until it finished and go ahead to do further process, if so how i can make bash wait until the hadoop command finished then go do other process?
here is my code, i tried both way, neither works:
1.
listCmd="hadoop fs -ls s3n://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#$S3_BUCKET/*/*/$mydate | grep s3n | awk -F' ' '{print $6}' | cut -f 4- -d / > $FILE_NAME"
echo -e "listing... $listCmd\n"
eval $listCmd
...other process ...
2.
echo -e "list the folders we want to copy into a file"
hadoop fs -ls s3n://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#$S3_BUCKET/*/*/$mydate | grep s3n | awk -F' ' '{print $6}' | cut -f 4- -d / > $FILE_NAME
... other process ....
any one knows what might be wrong? and is it better to use the eval function or just use the second way to run hadoop command directly
thanks.
I would prefer to eval in this case, prettier to append the next command to this one. and I would rather break down listCmd into parts, so that you know there is nothing wrong at the grep, awk or cut level.
listCmd="hadoop fs -ls s3n://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#$S3_BUCKET/*/*/$mydate > $raw_File"
gcmd="cat $raw_File | grep s3n | awk -F' ' '{print $6}' | cut -f 4- -d / > $FILE_NAME"
echo "Running $listCmd and other commands after that"
otherCmd="cat $FILE_NAME"
eval "$listCmd";
echo $? # This will print the exit status of the $listCmd
eval "$gcmd" && echo "Finished Listing" && eval "$otherCmd"
otherCmd will only be executed if $gcmd succeeds. If you have too many commands that you need to execute, then this becomes a bit ugly. If you roughly know how long it will take, you can insert a sleep command.
eval "$listCmd"
sleep 1800 # This will sleep 1800 seconds
eval "$otherCmd"

Resources