Torque does not limit the number of nodes mpiexec uses - parallel-processing

So I'm running these following pbs files at the same time:
qsub /mnt/folder/prueba1_1
qsub /mnt/folder/prueba01
An here are the files
prueba1_1
#!/bin/bash
#PBS -N pruebaF
#PBS -V
#PBS -l nodes=1:ppn=1
#PBS -q batch
#PBS -j eo
cd /mnt/folder
mpiexec -f machinefile ./cpi2>>salida1_1.o
prueba01
#!/bin/bash
#PBS -N pruebaF
#PBS -V
#PBS -l nodes=1:ppn=1
#PBS -q batch
#PBS -j eo
cd /mnt/folder
mpiexec -f machinefile ./cpi2>>salida01.o
The file machinefile contains 2 nodes slave02 and slave03 each one with 1 processor
Although I specify that each pbs file should use just 1 node and 1 processor per job (with #PBS -l nodes=1:ppn=1) the output files seems to show that each job is using both nodes at the same time. I'm wondering why since these pbsfiles should use just one node and 1 processor, for me It should be that prueba1_1 should use slave02 with 1 processor and prueba01 should use slave02 as well but with the other processor.
the output files are here
salida1_1.o
Process 0 of 2 is on slave02
Process 1 of 2 is on slave03
pi is approximately 3.1415926535900915, Error is 0.0000000000002984
wall clock time = 14.937282
salida01.o
Process 0 of 2 is on slave02
Process 1 of 2 is on slave03
pi is approximately 3.1415926535900915, Error is 0.0000000000002984
wall clock time = 14.741892

I would change machinefile to $PBS_NODEFILE. When Torque/PBS assigns nodes to your job it creates a file containing a list of those nodes and it sets the path to that file in the variable PBS_NODEFILE. I'm guessing machinefile was created for testing and since it is not created or updated by Torque that is why your jobs are always running the same way.

Related

PBS Scheduling on Allocating One Node

I am trying to request two nodes in a cluster setting; however, when I print ${PBS_NODEFILE}, only one node is visible. I am running this batch script on the login node. Any suggestions to why I am only seeing one node ?
#PBS -S /bin/bash
#PBS -V
#PBS -W block=true
#PBS -l nodes=2:ppn=12
#PBS -l walltime=01:00:00
#PBS -N resnet50
#PBS -A MyProject
echo "The nodefile for this job is stored at ${PBS_NODEFILE}"
cat ${PBS_NODEFILE}

Is there any way to Run codes between multiple Nodes on HPC

I am trying to run let's say 10 different codes each saved in it's respective directory named as 1,2,3,..,10.
#PBS -l nodes=10:cores=1
This means I had 1 thread each on 10 different CPU's. Now I had to submit a job so that each directory get's 1 thread of 1 CPU only, and similarly other directories 2,3..,10.
Codes are for molecular dynamics and runs for several hours, and they are independent as well. I tried by Gnu Parallel but I failed to employ each 10 CPU's. May be Gnu Parallel is made to distribute jobs in between 1 CPU cores. I know MPI can, but I don't know exactly how. May anyone please suggest.
I do not have access to a PBS cluster, but Example 2 from
https://www.nas.nasa.gov/hecc/support/kb/using-gnu-parallel-to-package-multiple-jobs-in-a-single-pbs-job_303.html might be what you are looking for:
#PBS -lselect=6:ncpus=4:model=san
#PBS -lwalltime=4:00:00
cd $PBS_O_WORKDIR
seq 64 | parallel -j 4 -u --sshloginfile $PBS_NODEFILE \
"cd $PWD; ./myscript.csh {}"
Adapted to your situation (untested):
#PBS -l place=scatter
#PBS -l nodes=10:cores=1
cd $PBS_O_WORKDIR
seq 10 | parallel -j 1 --sshloginfile $PBS_NODEFILE --wd $PBS_O_WORKDIR ./myscript {}
You need place=scatter because otherwise the same host may be listed twice in $PBS_NODEFILE, and GNU Parallel ignores duplicates.

how to run bash script inside the PBS script in the head node, after running the program in compute nodes

current working script (script-A)
#!/bin/bash
#PBS -N test7
#PBS -q batch
#PBS -l nodes=1:ppn=6,walltime=00:30:00
#PBS -j oe
cd \$PBS_O_WORKDIR
mpirun -np 6 /home/sai/1QE/qe-6.5/bin/pw.x < si.scf.in > 92scf.out<br>
What I want?
I want to run a "bash script analysis.sh" in the "HEAD NODE" after running the above job inside the compute node.
e.g. script-B
#!/bin/bash
#PBS -N test7
#PBS -q batch
#PBS -l nodes=1:ppn=6,walltime=00:30:00
#PBS -j oe
cd \$PBS_O_WORKDIR
mpirun -np 6 /home/sai/1QE/qe-6.5/bin/pw.x < si.scf.in > 92scf.out
bash analysis.sh
Problem
The above script-B is also fine, but not in my case.
my problem is the analysis program is installed only in my head node, not in the compute node.
so it will work only in the head node.
so, is there ant way to run the analysis.sh script in the head node after PBS script in the compute node.

Job crashes with OOM when I submit as a que (PBS/Torque) but not when I run simply run the command in terminal

Here's the job script I use,
#!/bin/bash
#PBS -q batch
#PBS -N simulation
#PBS -j n
#PBS -o /dev/null
#PBS -l nodes=1:ppn=1,pmem=3400mb
#PBS -l ncpus=1,mem=3400mb
cd ${PBS_O_WORKDIR} && \
./executable
I get this error:
Operating system error: Cannot allocate memory
Allocation would exceed memory limit
I've tried increasing the ppn and ncpus and mem such that it matches requirement of my program (~6GB). Doesn't help either.
This doesn't happen when I run the command in terminal. It works just fine.
Try run the program with all memory in the system allocated to the job.
Then at the very end execute qstat -f $PBS_JOBID| grep resources to see what Torque thinks about used resources.
resources_used.cput = 00:00:00
resources_used.energy_used = 0
resources_used.mem = 94724kb
resources_used.vmem = 1564024kb
resources_used.walltime = 00:03:25

Virtual memory allocation in cluster - line command

I`m running a code on a computer cluster with 24 nodes, 12 processors each one and something about 64Gb memory each node. The commands I'm using to launch it are the following
#!/bin/sh
#PBS -N cclit
#PBS -l walltime=288:00:00
#PBS -l nodes=1:ppn=1
#PBS -j oe
#PBS -m n
#PBS -l mem=60000mb
Unfortunately I realized that my code need at least a virtual memory which is 120000mb. What I tried to do has been to modify the above commands as
#!/bin/sh
#PBS -N cclit
#PBS -l walltime=288:00:00
#PBS -l nodes=2:ppn=2
#PBS -j oe
#PBS -m n
#PBS -l mem=120000mb
But it doesn't seem to work... It stops again at the same point telling me that virtual memory is not sufficient.
My code is not parallelized, meaning that only 1 processor is needed. What happens when the memory of a node is totally used?? I guess Im doing something wrong with '#PBS -l mem=120000mb', or probably I need some other command... I tried to look for a solution on the web but I didn t find anything..
Can you help me?
Thanks Mirko.

Resources