Run a job on all nodes of Sun Grid Engine cluster, only once - cluster-computing

I want to run a job on all the active nodes of a 64 node Sun Grid Engine Cluster, scheduled using qsub. I am currently using array-job variable for the same, but sometimes the program is scheduled multiple times on the same node.
qsub -t 1-64:1 -S /home/user/.local/bin/bash program.sh
Is it possible to schedule only one job per node, on all nodes parallely?

You could use a parallel environment. Create a parallel environment with :
qconf -ap "parallel_environment_name"
and set "allocation_rule" to 1, which means that all processes will have to reside on different hosts. Then when submitting your array job, specify your the number of nodes you want to use with your parallel environment. In your case :
qsub -t 1-64:1 -pe "parallel_environment_name" 64 -S /home/user/.local/bin/bash program.sh
For more information, check these links: http://linux.die.net/man/5/sge_pe and Configuring a new parallel environment at DanT's Grid Blog (link no longer working; there are copies on the wayback machine and softpanorama).

I you have a bash terminal, you can run
for host in $(qhost | tail -n +4 | cut -d " " -f 1); do qsub -l hostname=$host program.sh; done
"-l hostname=" specifies on which host to run the job.
The for loop iterates over the result returned by qstat to take each node and call the command specifying the host to use.

Related

sge can only run one task in one node

I had built the SGE in a four-node cluster for source code. The operating system in Centos7. And when I submit some simple task in the cluster, I found that only one task was running in one node. What's the problem? Here is my task code:
sleep 60
echo "done"
and this is my cmd to submit the tasks:
DIR=`pwd`
option=""
for((i=0;i<5;i++));do
qsub -q multislots $option -V -cwd -o stdout -e stderr -S /bin/bash $DIR/test.sh
sleep 1
done
when run qstat -f, it shows:enter image description here
Given the error message about jobs failing because: "can not find an unused add_grp_id". You should check what gid_range is set to in the sge configuration(both global and also if there is one per-host). It should be a range of otherwise unused group ids. At least as many gids as you want jobs on a node.
If that isn't it try running qalter -w v and qalter -w p on one of the queued jobs to see why they aren't being started.

gnu parallel one job per processor

I am trying to use gnu parallel GNU parallel (version 20160922)
to launch a large number of protein docking jobs (using UCSF Dock 6.7). I am running on a high performance cluster with several dozen nodes each with 28-40 cores. The system is running CentOS 7.1.1503, and uses torque for job management.
I am trying to submit each config file in dock.n.d to the dock executable, one per core on the cluster. Here is my PBS file:
#PBS -l walltime=01:00:00
#PBS -N pardock
#PBS -l nodes=1:ppn=28
#PBS -j oe
#PBS -o /home/path/to/pardock.log
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE temp.txt
#f=$(pwd)
ls dock.in.d/*.in | parallel -j 300 --sshloginfile $PBS_NODEFILE "/path/to/local/bin/dock6 -i {} -o {}.out"
This works fine on a single node as written above. But when I scale up to, say, 300 processors (with -l procs=300) accross several nodes I begin to get these errors:
parallel: Warning: ssh to node026 only allows for 99 simultaneous logins.
parallel: Warning: You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on node026.
What I do not understand is why there are so many logins. Each node only has 28-40 cores so, as specified in $PBS_NODEFILE, I would expect there to only be 28-40 SSH logins at any point in time on these nodes.
Am I misunderstanding or misexecuting something here? Please advise what other information I can provide or what direction I should go to get this to work.
UPDATE
So my problem above was the combination of -j 300 and the use of $PBS_NODEFILE, which has a separate entry for each core on each node. So in that case it seems I should used -j 1. But then, all the jobs seem to run on a single node.
So my question remains, how to get gnu parallel to balance the jobs between nodes, utilizing all cores, but not creating an excessive number of SSH logins due to multiple jobs per core.
Thank you!
You are asking GNU Parallel to ignore the number of cores and run 300 jobs on each server.
Try instead:
ls dock.in.d/*.in | parallel --sshloginfile $PBS_NODEFILE /path/to/local/bin/dock6 -i {} -o {}.out
This will default to --jobs 100% which is one job per core on all machines.
If you are not allowed to use all cores on the machines, you can in prepend X/ to the hosts in --sshloginfile to force X as the number of cores:
28/server1.example.com
20/server2.example.com
16/server3.example.net
This will force GNU Parallel to skip the detection of cores, and instead use 28, 20, and 16 respectively. This combined with -j 100% can control how many jobs you want started on the different servers.

sun grid engine qsub to all nodes

I have a master and two nodes. They are install with SGN. And I have a shell script ready on all the nodes as well. Now I want to use a qsub to submit the job on all my nodes.
I used:
qsub -V -b n -cwd /root/remotescript.sh
but it seems that only one node is doing the job. I am wondering how do I submit jobs for all nodes. What would the command be.
My reference is this enter link description here
SGE is meant to dispatch jobs to worker nodes. In your example, you create one job so one node will run it. If you want to run a job on each of your node, you need to submit more than one job. If you want to target nodes you probably should use something closer to
qsub -V -b n -cwd -l hostname=node001 /root/remotescript.sh
qsub -V -b n -cwd -l hostname=node002 /root/remotescript.sh
The "-l hostname=*" parameter will require a specific host to run the job.
What are you trying to do? The general use case of using a grid engine is to let the scheduler dispatch the jobs so you don't have to use the "-l hostname=*" parameter. So technically you should just submit a bunch of jobs to SGE and let it dispatch it with the nodes availability.
Finch_Powers answer is good for describing how SGE allocates resources. So, I'll elaborate below on specifics of you question, which may be why you are not getting the desired outcome.
You mention launching remote script via:
qsub -V -b n -cwd /root/remotescript.sh
Also, you mention again that these scripts are located on the nodes:
"And I have a shell script ready on all the nodes as well"
This is not how SGE is designed to work, although it can do this. Typical usage is to have same single (or multiple) scripts accessible to all nodes via network mounted storage on the execution nodes and let SGE decide which nodes to run the script on.
To run remote code, you may be better served using plain SSH.

qsub: requesting a job array from within a qsub session

I have a matlab script that processes a large amount of data using torque job arrays.
The server that I SSH into lacks the memory to load the data in the first place, so I need to request the compute node resources as a torque job, as follows:
qsub -I -V -l nodes=1:ppn=1,walltime=12:00:00,vmem=80G
However, when I now run the matlab script I am unable to submit torque job array requests. The error I am getting is as follows:
qsub: submit error (Job rejected by all possible destinations (check syntax, queue resources, ...))
The job array request given was:
qsub -t 1-$1 -l vmem=16G -l nodes=1:ppn=1,walltime=48:00:00 -v batchID=$2,batchDir=$3,funcName=$4 -e $5 -o $6 $HOME/scripts/job.sh
This command works fine outside of a qsub session, and the above error is not transient, so it appears that I cannot submit a request for a torque job array from within a qsub session.
How do I obtain the necessary memory resources from the compute nodes while also being able to submit requests for torque job arrays?
The cluster may not allow you to submit jobs from nodes in the cluster. You may be able to ask the admin to change this behavior or you can ssh to the head from within your first job and run the qsub there.
ssh head "qsub -t .........."

MPI not using all CPUs allocated

I am trying to run some code across multiple CPUs using MPI.
I run using:
$ mpirun -np 24 python mycode.py
I'm running on a cluster with 8 nodes, each with 12 CPUs. My 24 processes get scattered across all nodes.
Let's call the nodes node1, node2, ..., node8 and assume that the master process is on node1 and my job is the only one running. So node1 has the master process and a few slave processes, the rest of the nodes have only slave processes.
Only the node with the master process (ie node1) is being used. I can tell because nodes2-8 have load ~0 and node1 has load ~24 (whereas I would expect the load on each node to be approximately equal to the number of CPUs allocated to my job from that node). Also, each time a function is evaluated, I get it to print out the name of the host on which its running, and it prints out "node1" every time. I don't know whether the master process is the only one doing anything or if the slave processes on the same node as the master are also being used.
The cluster I'm running on was recently upgraded. Before the upgrade, I was using the same code and it behaved entirely as expected (i.e. when I asked for 24 CPUs, it gave me 24 CPUs and then used all 24 CPUs). This problem has only arisen since the upgrade, so I assume a setting somewhere got changed or reset. Has anyone seen this problem before and know how I might fix it?
Edit: This is submitted as a job to a scheduler using:
#!/bin/bash
#
#$ -cwd
#$ -pe * 24
#$ -o $JOB_ID.out
#$ -e $JOB_ID.err
#$ -r no
#$ -m n
#$ -l h_rt=24:00:00
echo job_id $JOB_ID
echo hostname $HOSTNAME
mpirun -np $NSLOTS python mycode.py
The cluster is running SGE and I submit this job using:
qsub myjob
It's also possible to specify where you want your jobs to run by using a hostfile. How the hostfile is formatted and used varies by MPI implementation so you'll need to consult the documentation for the one you have installed (man mpiexec) to find out how to use it.
The basic idea is that inside that file, you can define the nodes that you want to use and how many ranks you want on those nodes. This may require using other flags to specify how the processes are mapped to your nodes, but it the end, you can usually control how everything is laid out yourself.
All of this is different if you're using a scheduler like PBS, TORQUE, LoadLeveler, etc. as those can sometimes do some of this for you or have different ways of mapping jobs themselves. You'll have to consult the documentation for those separately or ask another question about them with the appropriate tags here.
Clusters usually have a batch scheduler like PBS, TORQUE, LoadLeveler, etc. These are generally given a shell script that contains your mpirun command along with environment variables that the scheduler needs. You should ask the administrator of your cluster what the process is for submitting batch MPI jobs.

Resources