MPI not using all CPUs allocated - multiprocessing

I am trying to run some code across multiple CPUs using MPI.
I run using:
$ mpirun -np 24 python mycode.py
I'm running on a cluster with 8 nodes, each with 12 CPUs. My 24 processes get scattered across all nodes.
Let's call the nodes node1, node2, ..., node8 and assume that the master process is on node1 and my job is the only one running. So node1 has the master process and a few slave processes, the rest of the nodes have only slave processes.
Only the node with the master process (ie node1) is being used. I can tell because nodes2-8 have load ~0 and node1 has load ~24 (whereas I would expect the load on each node to be approximately equal to the number of CPUs allocated to my job from that node). Also, each time a function is evaluated, I get it to print out the name of the host on which its running, and it prints out "node1" every time. I don't know whether the master process is the only one doing anything or if the slave processes on the same node as the master are also being used.
The cluster I'm running on was recently upgraded. Before the upgrade, I was using the same code and it behaved entirely as expected (i.e. when I asked for 24 CPUs, it gave me 24 CPUs and then used all 24 CPUs). This problem has only arisen since the upgrade, so I assume a setting somewhere got changed or reset. Has anyone seen this problem before and know how I might fix it?
Edit: This is submitted as a job to a scheduler using:
#!/bin/bash
#
#$ -cwd
#$ -pe * 24
#$ -o $JOB_ID.out
#$ -e $JOB_ID.err
#$ -r no
#$ -m n
#$ -l h_rt=24:00:00
echo job_id $JOB_ID
echo hostname $HOSTNAME
mpirun -np $NSLOTS python mycode.py
The cluster is running SGE and I submit this job using:
qsub myjob

It's also possible to specify where you want your jobs to run by using a hostfile. How the hostfile is formatted and used varies by MPI implementation so you'll need to consult the documentation for the one you have installed (man mpiexec) to find out how to use it.
The basic idea is that inside that file, you can define the nodes that you want to use and how many ranks you want on those nodes. This may require using other flags to specify how the processes are mapped to your nodes, but it the end, you can usually control how everything is laid out yourself.
All of this is different if you're using a scheduler like PBS, TORQUE, LoadLeveler, etc. as those can sometimes do some of this for you or have different ways of mapping jobs themselves. You'll have to consult the documentation for those separately or ask another question about them with the appropriate tags here.

Clusters usually have a batch scheduler like PBS, TORQUE, LoadLeveler, etc. These are generally given a shell script that contains your mpirun command along with environment variables that the scheduler needs. You should ask the administrator of your cluster what the process is for submitting batch MPI jobs.

Related

PBS torque: how to solve cores waste problem in parallel tasks that spend very different time from each other?

I'm running parallel MATLAB or python tasks in a cluster that is managed by PBS torque. The embarrassing situation now is that PBS think I'm using 56 cores but that's in the first and eventually I have only 7 hardest tasks running. 49 cores are wasted now.
My parallel tasks took very different time because they did searches in different model parameters, I didn't know which task will spend how much time before I have tried. In the start all cores were used but soon only the hardest tasks ran. Since the whole task was not finished yet PBS torque still thought I was using full 56 cores and prevent new tasks run but actually most cores were idle. I want PBS to detect this and use the idle cores to run new tasks.
So my question is that are there some settings in PBS torque that can automatically detect real cores used in the task, and allocate the really idle cores to new tasks?
#PBS -S /bin/sh
#PBS -N alps_task
#PBS -o stdout
#PBS -e stderr
#PBS -l nodes=1:ppn=56
#PBS -q batch
#PBS -l walltime=1000:00:00
#HPC -x local
cd /tmp/$PBS_O_WORKDIR
alpspython spin_half_correlation.py 2>&1 > tasklog.log
A short answer to your question is No: PBS has no way to reclaim unused resources allocated to a job.
Since your computation is essentially a bunch of independent tasks, what you could and probably should do is try to split your job into 56 independent jobs each running an individual combination of model parameters and when all the jobs are finished you could run an additional job to collect and summarize the results. This is a well supported way of doing things. PBS provides has some useful features for this type of jobs such as array jobs and job dependencies.

sun grid engine qsub to all nodes

I have a master and two nodes. They are install with SGN. And I have a shell script ready on all the nodes as well. Now I want to use a qsub to submit the job on all my nodes.
I used:
qsub -V -b n -cwd /root/remotescript.sh
but it seems that only one node is doing the job. I am wondering how do I submit jobs for all nodes. What would the command be.
My reference is this enter link description here
SGE is meant to dispatch jobs to worker nodes. In your example, you create one job so one node will run it. If you want to run a job on each of your node, you need to submit more than one job. If you want to target nodes you probably should use something closer to
qsub -V -b n -cwd -l hostname=node001 /root/remotescript.sh
qsub -V -b n -cwd -l hostname=node002 /root/remotescript.sh
The "-l hostname=*" parameter will require a specific host to run the job.
What are you trying to do? The general use case of using a grid engine is to let the scheduler dispatch the jobs so you don't have to use the "-l hostname=*" parameter. So technically you should just submit a bunch of jobs to SGE and let it dispatch it with the nodes availability.
Finch_Powers answer is good for describing how SGE allocates resources. So, I'll elaborate below on specifics of you question, which may be why you are not getting the desired outcome.
You mention launching remote script via:
qsub -V -b n -cwd /root/remotescript.sh
Also, you mention again that these scripts are located on the nodes:
"And I have a shell script ready on all the nodes as well"
This is not how SGE is designed to work, although it can do this. Typical usage is to have same single (or multiple) scripts accessible to all nodes via network mounted storage on the execution nodes and let SGE decide which nodes to run the script on.
To run remote code, you may be better served using plain SSH.

PBS: job on two nodes uses memory of only one

I am trying to run a job (python code) on cluster using MPI. There is 63GB of memory available on each node.
When I run it on one node, I specify PBS parameters with (only relevant parameters are listed here):
#PBS -l mem=60GB
#PBS -l nodes=node01.cluster:ppn=32
time mpiexec -n 32 python code.py
Than works just fine.
Since PBS man page says mem is memory per entire job, my parameters when trying to run it on two nodes, are
#PBS -l mem=120GB
#PBS -l nodes=node01.cluster:ppn=32+node02.cluster:ppn=32
time mpiexec -n 64 python code.py
This doesn't work (qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max mem requirement). It fails even if I set mem=70GB for example (in case system needs some more memory).
If I set mem=60GB when trying to use both nodes, I get
=>> PBS: job killed: mem job total xx kb exceeded limit yy kb.
I tried it with pmem as well (that's pmem=1875MB), but no success.
My question is: How can I use entire 120GB of memory?
Torque / PBS ignores the mem resource unless the job uses a single node (see here):
Maximum amount of physical memory used by the job. (Ignored on Darwin, Digital Unix, Free BSD, HPUX 11, IRIX, NetBSD, and SunOS. Also ignored on Linux if number of nodes is not 1. Not implemented on AIX and HPUX 10.)
You should instead use the pmem resource that limits the memory per job process. With ppn=32 you should set pmem to 1920MB in order to get 60 GB per node. In that case you should mind that pmem does not allow flexible distribution of memory between the processes running on the node the same way mem does (since the latter is accounted as an aggregated value while pmem applies to each process individually).

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production hadoop cluster, but needs must.
To do this I started by writing a script that runs with srun eg srun -N 4 setup.sh. This script writes the configuration files and starts the servers on the allocated nodes, with the lowest numbered machine acting as the master. This all works, and I am able to run applications.
However, as I would like to start the servers once and then launch multiple applications on them without restarting/encoding everything in at the begining I would like to use salloc instead. I had thought that this would be a simple case of running salloc -N 4 and then running srun setup.sh. Unfortunately this does not work as the different servers are unable to communicate with each other. Could any one explain to me what the difference in the operating environment is between using srun and using salloc then srun?
Many thanks
Daniel
From the slurm-users mailing list:
sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.
srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

Run a job on all nodes of Sun Grid Engine cluster, only once

I want to run a job on all the active nodes of a 64 node Sun Grid Engine Cluster, scheduled using qsub. I am currently using array-job variable for the same, but sometimes the program is scheduled multiple times on the same node.
qsub -t 1-64:1 -S /home/user/.local/bin/bash program.sh
Is it possible to schedule only one job per node, on all nodes parallely?
You could use a parallel environment. Create a parallel environment with :
qconf -ap "parallel_environment_name"
and set "allocation_rule" to 1, which means that all processes will have to reside on different hosts. Then when submitting your array job, specify your the number of nodes you want to use with your parallel environment. In your case :
qsub -t 1-64:1 -pe "parallel_environment_name" 64 -S /home/user/.local/bin/bash program.sh
For more information, check these links: http://linux.die.net/man/5/sge_pe and Configuring a new parallel environment at DanT's Grid Blog (link no longer working; there are copies on the wayback machine and softpanorama).
I you have a bash terminal, you can run
for host in $(qhost | tail -n +4 | cut -d " " -f 1); do qsub -l hostname=$host program.sh; done
"-l hostname=" specifies on which host to run the job.
The for loop iterates over the result returned by qstat to take each node and call the command specifying the host to use.

Resources