Excluding lists of hosts from SGE job submission - bash

I am using a cluster running SGE 8.1.9. Some nodes on the server are broken and some are working. I have a list of node host-names which are working OK, so I want to submit my array job to those nodes only.
I have successfully submitted jobs to a single node which works:
qsub -t 5:18 -l h=nodeA myScipt.sh
However, I want to submit my jobs to a list of working nodes, e.g.:
qsub -t 5:18 -l h=nodeA,nodeB,nodeC myScipt.sh
But this throws:
Unable to run job: unknown resource "nodeB"
Exiting.
What is the correct syntax to submit your array job to a list of nodes if you have their hostnames?

I figured you can include the flag:
#$ -l h=!(nodeA|nodeB)

Related

How to adjust bash file to execute on a single node

I would like your help to know whether it is possible (and if yes how) to adjust the bash file below.
I have a principal Matlab script main.m, which in turn calls another Matlab script f.m.
f.m should be executed many times with different inputs.
I structure this as an array job.
I typically use the following bash file called td.sh to execute the array job into the HPC of my university
#$ -S /bin/bash
#$ -l h_vmem=5G
#$ -l tmem=5G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y
#Run 237 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 237
#$ -t 1-237
#$ -N mod
date
hostname
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
What I do in the terminal is
cd to the folder where the scripts main.m, f.m, td.sh are located
type in the terminal qsub td.sh
Question: I need to change the bash file above because the script f.m calls a solver (Gurobi) whose license is single node single user. This is what I have been told:
" This license has been installed already and works only on node A.
You will not be able to qsub your scripts as the jobs have to run on this node.
Instead you should ssh into node A and run the job on this node directly instead
of submitting to the scheduler. "
Could you guide me through understanding how I should change the bash file above? In particular, how should I force the execution into node A?
Even though I am restricted to one node only, am I still able to parallelise using array jobs? Or array jobs are by definition executed on multiple nodes?
If you cannot use your scheduler, then you cannot use its array jobs. You will have to find another way to parallelize those jobs. Array jobs are not executed on multiple nodes by definition (but they are usually executed on multiple nodes due to resource availability).
Regarding the adaptation of your script, just follow the guidelies provided by your sysadmins: forget about SGE and start your calculus through ssh directly against the node you have been told:
date
hostname
for TASK_ID in {1..237}
do
#Output the Task ID
echo "Task ID is $TASK_ID"
ssh user#A "/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r \"main; ID = $TASK_ID; f; exit\""
done
If the license is single node and single user (but multiple simultaneous execution), you can try to parallelize the calculus. You will have to take into account the resources available in the node A (number of CPUs, memory...) and the resources that you need for every single execution, and then start simultaneously as many calculus as possible without overloading the node (otherwise they will take longer or even fail).

Array job with unknown task number

I would like to submit an array job on a cluster running SGE.
I know how to use array jobs with the -t option (for instance, qsub -t 1-1000 somescript.sh).
What if I don't know how many tasks I have to submit? The idea would be to use something like (not working):
qsub -t 1- somescript.sh
The submission would then go for all the n tasks, with unknown n.
No, open-ended arrays are not a built-in capability (nor can you add jobs to an array after initial submission).
I'm guessing about why you want to do this, but here's one idea for keeping track of a group of jobs like this: specify a shared name for the set of jobs, appending a counter.
So, for example, you'd include -N myjob.<counter> in your qsub (or add a #PBS script line for it):
-N myjob.1
-N myjob.2
...
-N myjob.n

How to see the output of a job submitted through qsub in my terminal?

I am submitting this simple job to SGE through qsub. How can I see the output of the job which is a simple echo in my terminal. I mean I want it directly on screen not diverting the output to a logfile or something.
So here is the job stored in Dummyjob:
#!/bin/sh
#$ -j y
#$ -S /bin/sh
#$ -q long.q
sleep 30
echo "I'm done!"
And this is the qsub command:
qsub -N job_1 -cwd./Dummyjob
Thank you!
It doesn't do that. You're referring to a batch facility, e.g., How to submit a job using qsub.
Looking at the command-line options, these are the possibilities:
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m ea Will send email when job ends or aborts
You can ask it to send mail when the job is done (successfully or not). Or you might be able to make it write to a fifo, e.g., in one terminal you would do
mkfifo myFakeFile
tail -f myFakeFile
and then use
-o myFakeFile
when submitting (in that order, so that something is waiting). But if the program does any checking, it will not write to a fifo (because it is not a regular file).
Further reading:
qsub - submit a batch job to Sun Grid Engine.
6.3.2 Creating a FIFO (The Linux Programmer's Guide)
The previous answer mentions that you are submitting a 'batch job script' and this is true, so you will not see the output on your terminal (tty) but the stdout/stderr will be sent to output files. However that doesn't mean you can't run an interactive job through Grid Engine. You can, just use 'qrsh' instead of using 'qsub' and the script will be run on a remote machine chosen by Grid Engine - the results will be displayed on your screen.
Note: You might have to configure qrsh in your Grid Engine Cluster for this to work.

command in .bashrc file cannot be executed correctly when submitted a pbs job

I have a script to submit a job in bash shell, which looks like
#!/bin/bash
# PBS -l nodes=1:ppn=1
#PBS -l walltime=00:30:00
#PBS -N xxxxx
However, after I submitted my job, I got an error message in xxxxx.e8980 file as follows:
/home/xxxxx/.bashrc: line 1: /etc/ini.modules: No such file or directory
but the file /etc/ini.modules is there. Why the system cannot find it?
Thank you very much!
When referencing files in a job that will be submitted to a cluster, you must either force the job to the specific node(s) that have the file or make sure the file is present on all compute nodes in the cluster.

How to submit a job to a specific node in PBS

How do I send a job to a specific node in PBS/TORQUE?
I think you must specify the node name after nodes.
#PBS -l nodes=abc
However, this doesn't seem to work and I'm not sure why.
This question was asked here on PBS and specify nodes to use
Here is my sample code
#!/bin/bash
#PBS nodes=node9,ppn=1,
hostname
date
echo "This is a script"
sleep 20 # run for a while so I can look at the details
date
Also, how do I check which node the job is running on? I saw somewhere that $PBS_NODEFILE shows the details, but it doesn't seem to work for me.
You can do it like this:
#PBS -l nodes=<node_name>
You can also specify the number of processors:
#PBS -l nodes=<node_name>:ppn=X
Or you can request additional nodes, specified or unspecified:
#PBS -l nodes=<node_name1>[:ppn=X][+<node_name2...]
That gives you multiple specific nodes.
#PBS -l nodes=<node_name>[:ppn=X][+Y[:ppn=Z]]
This requests the specific node with X execution slots from that node, plus an additional Y nodes with Z execution slots each.
Edit: To simply request a number of nodes and execution slots per node:
PBS -l nodes=X:ppn=Y
NOTE: this is all for TORQUE/Moab. It may or may not work for other PBS resource managers/schedulers.
The above answer doesn't work for PBS Pro. The following works for including a list of nodes (node1 and node2).
#PBS -l select=1:host=node1+1:host=node2
For also including the number of processors,
#PBS -l select=1:ncpus=24:host=node1+1:ncpus=24:host=node2

Resources