Holding job on qsub - shell

I need help on launching a job using qsub which is dependent on multiple previous jobs. Say I have submitted following jobs
qsub -N job1 -cwd job1_script
qsub -N job2 -cwd job2_script
qsub -N job3 -cwd job3_script
I need to hold a job dependent on all three jobs above.
I know we can do this as :
qsub -hold_jid job1,job2,job3 -cwd job4_script
Is there a way where to pass job ids to -hold_jid as pattern .Something like this
qsub -hold_jid job* -cwd job4_script
I will be submitting N number of jobs and needs to hold the last one till N jobs are complete, hence I am looking for a way to do it.

Related

Read job name from bash script parameters in SGE

I am running Sun Grid Engine for submitting jobs, and I want to have a bash script that sends in any file I need to run, instead of having to run a different qsub command with a different bash file for each of the jobs. I have been capable of generating output and error files that share the name of the input file, but now I am struggling with setting a different name for each file. My approach has been the following:
#!/bin/bash
#
#$ -cwd
#$ -S /bin/bash
#$ -N $1
#
python -u $1 >/output_dir/$1.out 2>/error_dir/$1.error
This way, running qsub send_to_sge.sh foo executes the program, and creates the files foo.error and foo.out with the errors and printouts, respectively. However, the job appears with the name $1 in the SGE queue. Instead, I would like to have foo as the job name. Is there any way to achieve what I am seeking?

What does the keyword --exclusive mean in slurm?

This is a follow up question from [How to run jobs in paralell using one slurm batch script?]. The goal was to create a single SBatch-Script, which can start multiple processes and run them in parallel. The Answer given by
damienfrancois was very detailed and looked something like this.
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --partition=All
srun -n 1 -c 1 --exclusive sleep 60 &
srun -n 1 -c 1 --exclusive sleep 60 &
....
wait
However, I am not able to understand the exclusive keyword. If I use the keyword, one node of the cluster is chosen and all processes are launched there. However, I would like Slurm to distribute the ["sleeps"/steps] over the entire cluster.
So how does the keyword exclusive work ? According to the Slurm documentaion, the restriction to one node should not happen, since the keyword is used within a step-allocation.
[I am new to Slurm]

Do I need a single bash file for each task in SLURM?

I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files.
Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
srun python plotConvergence.py i
I am then using another bash file to submit all these tasks, slurm_run_all.sh
#!/bin/bash
for i in {1..50}:
sbatch slurm_run_$i.sh
done
This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:
#!/bin/bash
#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1
#SBATCH -J pltall
#SBATCH --mem=30G
# Running jobs
srun python plotConvergence.py 1 &
srun python plotConvergence.py 2 &
...
srun python plotConvergence.py 49 &
srun python plotConvergence.py 50 &
wait
echo "All done"
Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.
Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?
What you are looking for is the jobs array feature of Slurm.
In your case, you would have a single submission file (slurm_run.sh) like this:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
#SBATCH --array=1-50
srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}
and then submit the array of jobs with
sbatch slurm_run.sh
You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.

Hold remainder of shell script commands until PBS qsub array job completes

I am very new to shell scripting, and I am trying to write a shell pipeline that submits multiple qsub jobs, but has several commands to run in between these qsubs, which are contingent on the most recent job completing. I have been researching multiple ways to try and hold the shell script from proceeding after submission of a qsub job, but none have been successful.
The simplest chunk of code I can provide to illustrate the issue is as follows:
THREADS=`wc -l < list1.txt`
qsub -V -t 1-$THREADS firstjob.sh
echo "firstjob.sh completed"
There are obviously other lines of code after this that are actually contingent on firstjob.sh finishing, but I have omitted them here for clarity. I have tried the following methods of pausing/holding the script:
1) Only using wait, which is supposed to stop the script until all background programs are completed. This pushed right past the wait and printed the echo statement to the terminal while the array job was still running. My guess is this is occurring because once the qsub job is submitted, is exits and wait thinks it has completed?
qsub -V -t 1-$THREADS firstjob.sh
wait
echo "firstjob.sh completed"
2) Setting the job to a variable, echoing that variable to submit the job, and using the the entire job ID along with wait to pause. The echo command should wait until all elements of the array job have completed.The error message is shown following the code, within the code block.
job1=$(qsub -V -t 1-$THREADS firstjob.sh)
echo "$job1"
wait $job1
echo "firstjob.sh completed"
####ERROR RECEIVED####
-bash: wait: `4585057[].cluster-name.local': not a pid or valid job spec
3) Using the -sync y for qsub. This should prevent it from exiting the qsub until the job is complete, acting as an effective pause...I had hoped. Error in comment after the commands. For some reason it is not reading the -sync option correctly?
qsub -V -sync y -t 1-$THREADS firstjob.sh
echo "firstjob.sh completed"
####ERROR RECEIVED####
qsub: script file 'y' cannot be loaded - No such file or directory
4) Using a dummy shell script (the dummy just makes an empty file) so that I could use the -W depend=afterok: option of qsub to pause the script. This again pushes right past to the echo statement without any pause for submitting the dummy script. Both jobs get submitted, one right after the other, no pause.
job1=$(qsub -V -t 1-$THREADS demux.sh)
echo "$job1"
check=$(qsub -V -W depend=afterok:$job1 dummy.sh)
echo "$check"
echo "firstjob.sh completed"
Some further details regarding the script:
Each job submission is an array job.
The pipeline is being run in the terminal using a command resembling the following, so that I may provide it with 3 inputs: source Pipeline.sh -r list1.txt -d /workingDir/ -s list2.txt
I am certain that the firstjob.sh has not actually completed running because I see them in the queue when I use showq.
Perhaps there is an easy fix in most of these scenarios, but being new to all this, I am really struggling. I have to use this method in 8-10 places throughout the script, so it is really hindering progress. Would appreciate any assistance. Thanks.
POST EDIT 1
Here is the code contained in firstjob.sh...though doubtful that it will help. Everything in here functions as expected, always produces the correct results.
\#! /bin/bash
\#PBS -S /bin/bash
\#PBS -N demux
\#PBS -l walltime=72:00:00
\#PBS -j oe
\#PBS -l nodes=1:ppn=4
\#PBS -l mem=15gb
module load biotools
cd ${WORKDIR}/rawFQs/
INFILE=`head -$PBS_ARRAYID ${WORKDIR}${RAWFQ} | tail -1`
BASE=`basename "$INFILE" .fq.gz`
zcat $INFILE | fastx_barcode_splitter.pl --bcfile ${WORKDIR}/rawFQs/DemuxLists/${BASE}_sheet4splitter.txt --prefix ${WORKDIR}/fastqs/ --bol --suffix ".fq"
I just tried using -sync y, and that worked for me, so good idea there... Not sure what's different about your setup.
But a couple other things you could try involve your main script knowing the status of the qsub jobs you're running. One idea is that you could have your main script check the status of your job using qstat and wait until it finishes before proceeding.
Alternatively, you could have the first job write to a file as its last step (or, as you suggested, set up a dummy job that waits for the first job to finish). Then in your main script, you can test to see whether that file has been written before going on.

Get SGE jobid to make a pipeline

Suppose I want to write a pipeline of tasks to submit to Sun/Oracle Grid Engine.
qsub -cwd touch a.txt
qsub -cwd -hold_jid touch wc -l a.txt
Now, this will run the 2nd job (wc) only after the first job (touch) is done. However, if a previous job with the name touch had run earlier, the 2nd job won't be held since the condition is already satisfied. I need the jobid of the first job.
I tried
myjid=`qsub -cwd touch a.txt`
But it gave $ echo $myjid
Your job 1062487 ("touch") has been submitted
You just need to add the -terse option to the first qsub so that it only displays the jobid rather than the whole string.
JID=`qsub -terse -cwd touch a.txt`

Resources