How to submit a job to a specific node in PBS - bash

How do I send a job to a specific node in PBS/TORQUE?
I think you must specify the node name after nodes.
#PBS -l nodes=abc
However, this doesn't seem to work and I'm not sure why.
This question was asked here on PBS and specify nodes to use
Here is my sample code
#!/bin/bash
#PBS nodes=node9,ppn=1,
hostname
date
echo "This is a script"
sleep 20 # run for a while so I can look at the details
date
Also, how do I check which node the job is running on? I saw somewhere that $PBS_NODEFILE shows the details, but it doesn't seem to work for me.

You can do it like this:
#PBS -l nodes=<node_name>
You can also specify the number of processors:
#PBS -l nodes=<node_name>:ppn=X
Or you can request additional nodes, specified or unspecified:
#PBS -l nodes=<node_name1>[:ppn=X][+<node_name2...]
That gives you multiple specific nodes.
#PBS -l nodes=<node_name>[:ppn=X][+Y[:ppn=Z]]
This requests the specific node with X execution slots from that node, plus an additional Y nodes with Z execution slots each.
Edit: To simply request a number of nodes and execution slots per node:
PBS -l nodes=X:ppn=Y
NOTE: this is all for TORQUE/Moab. It may or may not work for other PBS resource managers/schedulers.

The above answer doesn't work for PBS Pro. The following works for including a list of nodes (node1 and node2).
#PBS -l select=1:host=node1+1:host=node2
For also including the number of processors,
#PBS -l select=1:ncpus=24:host=node1+1:ncpus=24:host=node2

Related

Excluding lists of hosts from SGE job submission

I am using a cluster running SGE 8.1.9. Some nodes on the server are broken and some are working. I have a list of node host-names which are working OK, so I want to submit my array job to those nodes only.
I have successfully submitted jobs to a single node which works:
qsub -t 5:18 -l h=nodeA myScipt.sh
However, I want to submit my jobs to a list of working nodes, e.g.:
qsub -t 5:18 -l h=nodeA,nodeB,nodeC myScipt.sh
But this throws:
Unable to run job: unknown resource "nodeB"
Exiting.
What is the correct syntax to submit your array job to a list of nodes if you have their hostnames?
I figured you can include the flag:
#$ -l h=!(nodeA|nodeB)

How to adjust bash file to execute on a single node

I would like your help to know whether it is possible (and if yes how) to adjust the bash file below.
I have a principal Matlab script main.m, which in turn calls another Matlab script f.m.
f.m should be executed many times with different inputs.
I structure this as an array job.
I typically use the following bash file called td.sh to execute the array job into the HPC of my university
#$ -S /bin/bash
#$ -l h_vmem=5G
#$ -l tmem=5G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y
#Run 237 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 237
#$ -t 1-237
#$ -N mod
date
hostname
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
What I do in the terminal is
cd to the folder where the scripts main.m, f.m, td.sh are located
type in the terminal qsub td.sh
Question: I need to change the bash file above because the script f.m calls a solver (Gurobi) whose license is single node single user. This is what I have been told:
" This license has been installed already and works only on node A.
You will not be able to qsub your scripts as the jobs have to run on this node.
Instead you should ssh into node A and run the job on this node directly instead
of submitting to the scheduler. "
Could you guide me through understanding how I should change the bash file above? In particular, how should I force the execution into node A?
Even though I am restricted to one node only, am I still able to parallelise using array jobs? Or array jobs are by definition executed on multiple nodes?
If you cannot use your scheduler, then you cannot use its array jobs. You will have to find another way to parallelize those jobs. Array jobs are not executed on multiple nodes by definition (but they are usually executed on multiple nodes due to resource availability).
Regarding the adaptation of your script, just follow the guidelies provided by your sysadmins: forget about SGE and start your calculus through ssh directly against the node you have been told:
date
hostname
for TASK_ID in {1..237}
do
#Output the Task ID
echo "Task ID is $TASK_ID"
ssh user#A "/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r \"main; ID = $TASK_ID; f; exit\""
done
If the license is single node and single user (but multiple simultaneous execution), you can try to parallelize the calculus. You will have to take into account the resources available in the node A (number of CPUs, memory...) and the resources that you need for every single execution, and then start simultaneously as many calculus as possible without overloading the node (otherwise they will take longer or even fail).

Array job with unknown task number

I would like to submit an array job on a cluster running SGE.
I know how to use array jobs with the -t option (for instance, qsub -t 1-1000 somescript.sh).
What if I don't know how many tasks I have to submit? The idea would be to use something like (not working):
qsub -t 1- somescript.sh
The submission would then go for all the n tasks, with unknown n.
No, open-ended arrays are not a built-in capability (nor can you add jobs to an array after initial submission).
I'm guessing about why you want to do this, but here's one idea for keeping track of a group of jobs like this: specify a shared name for the set of jobs, appending a counter.
So, for example, you'd include -N myjob.<counter> in your qsub (or add a #PBS script line for it):
-N myjob.1
-N myjob.2
...
-N myjob.n

Check real time output after qsub a job on cluster

Here is my pbs file:
#!/bin/bash
#PBS -N myJob
#PBS -j oe
#PBS -k o
#PBS -V
#PBS -l nodes=hpg6-15:ppn=12
cd ${PBS_O_WORKDIR}
./mycommand
On qsub documentation page, it seems like if I put the line
PBS -k o, I should be able to check the real time output in a file named myJob.oJOBID in my home dir. However when I check the output by tail -f or cat or more in runtime, it shows nothing in the file. Only when I terminated the job, then the file would show the output. Is there anything I should check to make the stream flush to the output file in real time?
By default, the files are created on the nodes and copied to your home directory when the job completes. The cluster admin can change this behavior by adding "$spool_as_final_name true" to the config file in the mom_priv directory on each node.
Torque MOM Configuration, parameters
Assuming you are allowed to login to the node running your process (this is allowed by the admin of our cluster for the duration of the job, not sure if this is common or not), then you can have real-time output by
Getting the PID of your process
Browsing through the files that this process has opened with lsof -n -p <PID>, and finding the file whose name "looks like" that of a log. In our cluster the files are
/cm/local/apps/pbspro-ce/var/spool/spool/[JOBID][server].OU
/cm/local/apps/pbspro-ce/var/spool/spool/[JOBID][server].ER
The .OU is stdout and the .ER is stderr. You can then tail -f to get real time output.
The output of lsof can be pretty long though, so you should try grepping your JOBID, or maybe this bit pbspro-ce/var/spool/.
Curious to know if this can be replicated in clusters other than our own.

command in .bashrc file cannot be executed correctly when submitted a pbs job

I have a script to submit a job in bash shell, which looks like
#!/bin/bash
# PBS -l nodes=1:ppn=1
#PBS -l walltime=00:30:00
#PBS -N xxxxx
However, after I submitted my job, I got an error message in xxxxx.e8980 file as follows:
/home/xxxxx/.bashrc: line 1: /etc/ini.modules: No such file or directory
but the file /etc/ini.modules is there. Why the system cannot find it?
Thank you very much!
When referencing files in a job that will be submitted to a cluster, you must either force the job to the specific node(s) that have the file or make sure the file is present on all compute nodes in the cluster.

Resources