PBS torque: how to solve cores waste problem in parallel tasks that spend very different time from each other? - parallel-processing

I'm running parallel MATLAB or python tasks in a cluster that is managed by PBS torque. The embarrassing situation now is that PBS think I'm using 56 cores but that's in the first and eventually I have only 7 hardest tasks running. 49 cores are wasted now.
My parallel tasks took very different time because they did searches in different model parameters, I didn't know which task will spend how much time before I have tried. In the start all cores were used but soon only the hardest tasks ran. Since the whole task was not finished yet PBS torque still thought I was using full 56 cores and prevent new tasks run but actually most cores were idle. I want PBS to detect this and use the idle cores to run new tasks.
So my question is that are there some settings in PBS torque that can automatically detect real cores used in the task, and allocate the really idle cores to new tasks?
#PBS -S /bin/sh
#PBS -N alps_task
#PBS -o stdout
#PBS -e stderr
#PBS -l nodes=1:ppn=56
#PBS -q batch
#PBS -l walltime=1000:00:00
#HPC -x local
cd /tmp/$PBS_O_WORKDIR
alpspython spin_half_correlation.py 2>&1 > tasklog.log

A short answer to your question is No: PBS has no way to reclaim unused resources allocated to a job.
Since your computation is essentially a bunch of independent tasks, what you could and probably should do is try to split your job into 56 independent jobs each running an individual combination of model parameters and when all the jobs are finished you could run an additional job to collect and summarize the results. This is a well supported way of doing things. PBS provides has some useful features for this type of jobs such as array jobs and job dependencies.

Related

Jobs allocate twice the cores that I request on SLURM

I am trying to understand why twice the amount of cores I request are being allocated to my sbatch jobs.
From what I can tell, my partition has 106 threads:
[.... snake_make]$ sinfo -p mypartition -o %z
S:C:T
2:26:2
Yet with the sbatch set like so for my snakemake:
module load snakemake/5.6.0
snakemake -s snake_make_tetragonula --cluster-config cluster.yaml --jobs 70
--cluster "sbatch -n 4 -M {cluster.cluster} -A {cluster.account} -p {cluster.partition}"
--latency-wait 10
Each job is being allocated 8 cores instead of 4. When I run squeue, I see that it is only able to run as many as 12 jobs at a time, suggesting that it is using 8 cores for each job despite me specifying 4 threads. Also when I look at my job usage on XDMoD, I see that only half of the cpus on the job are getting used. How can I use exactly as many cpus as I want and not double that amount, like it is currently running? I have also tried
--ntasks=1 --cpus-per-task=4
which still doubled it to 8. Thanks.
Slurm can only allocate cores, not threads. So, with such a configuration:
S:C:T
2:26:2
two threads are allocated to jobs for each core being requested. Two hardware threads cannot be allocated to distinct jobs.
You can try with
--ntasks=1 --cpus-per-task=2 --threads-per-core=2
But, if your computation is CPU-intensive, this can make your jobs slower.

Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.
A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

gnu parallel one job per processor

I am trying to use gnu parallel GNU parallel (version 20160922)
to launch a large number of protein docking jobs (using UCSF Dock 6.7). I am running on a high performance cluster with several dozen nodes each with 28-40 cores. The system is running CentOS 7.1.1503, and uses torque for job management.
I am trying to submit each config file in dock.n.d to the dock executable, one per core on the cluster. Here is my PBS file:
#PBS -l walltime=01:00:00
#PBS -N pardock
#PBS -l nodes=1:ppn=28
#PBS -j oe
#PBS -o /home/path/to/pardock.log
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE temp.txt
#f=$(pwd)
ls dock.in.d/*.in | parallel -j 300 --sshloginfile $PBS_NODEFILE "/path/to/local/bin/dock6 -i {} -o {}.out"
This works fine on a single node as written above. But when I scale up to, say, 300 processors (with -l procs=300) accross several nodes I begin to get these errors:
parallel: Warning: ssh to node026 only allows for 99 simultaneous logins.
parallel: Warning: You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on node026.
What I do not understand is why there are so many logins. Each node only has 28-40 cores so, as specified in $PBS_NODEFILE, I would expect there to only be 28-40 SSH logins at any point in time on these nodes.
Am I misunderstanding or misexecuting something here? Please advise what other information I can provide or what direction I should go to get this to work.
UPDATE
So my problem above was the combination of -j 300 and the use of $PBS_NODEFILE, which has a separate entry for each core on each node. So in that case it seems I should used -j 1. But then, all the jobs seem to run on a single node.
So my question remains, how to get gnu parallel to balance the jobs between nodes, utilizing all cores, but not creating an excessive number of SSH logins due to multiple jobs per core.
Thank you!
You are asking GNU Parallel to ignore the number of cores and run 300 jobs on each server.
Try instead:
ls dock.in.d/*.in | parallel --sshloginfile $PBS_NODEFILE /path/to/local/bin/dock6 -i {} -o {}.out
This will default to --jobs 100% which is one job per core on all machines.
If you are not allowed to use all cores on the machines, you can in prepend X/ to the hosts in --sshloginfile to force X as the number of cores:
28/server1.example.com
20/server2.example.com
16/server3.example.net
This will force GNU Parallel to skip the detection of cores, and instead use 28, 20, and 16 respectively. This combined with -j 100% can control how many jobs you want started on the different servers.

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production hadoop cluster, but needs must.
To do this I started by writing a script that runs with srun eg srun -N 4 setup.sh. This script writes the configuration files and starts the servers on the allocated nodes, with the lowest numbered machine acting as the master. This all works, and I am able to run applications.
However, as I would like to start the servers once and then launch multiple applications on them without restarting/encoding everything in at the begining I would like to use salloc instead. I had thought that this would be a simple case of running salloc -N 4 and then running srun setup.sh. Unfortunately this does not work as the different servers are unable to communicate with each other. Could any one explain to me what the difference in the operating environment is between using srun and using salloc then srun?
Many thanks
Daniel
From the slurm-users mailing list:
sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.
srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

MPI not using all CPUs allocated

I am trying to run some code across multiple CPUs using MPI.
I run using:
$ mpirun -np 24 python mycode.py
I'm running on a cluster with 8 nodes, each with 12 CPUs. My 24 processes get scattered across all nodes.
Let's call the nodes node1, node2, ..., node8 and assume that the master process is on node1 and my job is the only one running. So node1 has the master process and a few slave processes, the rest of the nodes have only slave processes.
Only the node with the master process (ie node1) is being used. I can tell because nodes2-8 have load ~0 and node1 has load ~24 (whereas I would expect the load on each node to be approximately equal to the number of CPUs allocated to my job from that node). Also, each time a function is evaluated, I get it to print out the name of the host on which its running, and it prints out "node1" every time. I don't know whether the master process is the only one doing anything or if the slave processes on the same node as the master are also being used.
The cluster I'm running on was recently upgraded. Before the upgrade, I was using the same code and it behaved entirely as expected (i.e. when I asked for 24 CPUs, it gave me 24 CPUs and then used all 24 CPUs). This problem has only arisen since the upgrade, so I assume a setting somewhere got changed or reset. Has anyone seen this problem before and know how I might fix it?
Edit: This is submitted as a job to a scheduler using:
#!/bin/bash
#
#$ -cwd
#$ -pe * 24
#$ -o $JOB_ID.out
#$ -e $JOB_ID.err
#$ -r no
#$ -m n
#$ -l h_rt=24:00:00
echo job_id $JOB_ID
echo hostname $HOSTNAME
mpirun -np $NSLOTS python mycode.py
The cluster is running SGE and I submit this job using:
qsub myjob
It's also possible to specify where you want your jobs to run by using a hostfile. How the hostfile is formatted and used varies by MPI implementation so you'll need to consult the documentation for the one you have installed (man mpiexec) to find out how to use it.
The basic idea is that inside that file, you can define the nodes that you want to use and how many ranks you want on those nodes. This may require using other flags to specify how the processes are mapped to your nodes, but it the end, you can usually control how everything is laid out yourself.
All of this is different if you're using a scheduler like PBS, TORQUE, LoadLeveler, etc. as those can sometimes do some of this for you or have different ways of mapping jobs themselves. You'll have to consult the documentation for those separately or ask another question about them with the appropriate tags here.
Clusters usually have a batch scheduler like PBS, TORQUE, LoadLeveler, etc. These are generally given a shell script that contains your mpirun command along with environment variables that the scheduler needs. You should ask the administrator of your cluster what the process is for submitting batch MPI jobs.

Resources