Using maximum remote servers - parallel-processing

Im trying to distribute commands to 100 remote computers, but noticed that the commands are only being sent to 16 remote computers. My local machine has 16 cores. Why is parallel only using 16 remote computers instead of 100?
parallel --eta --sshloginfile list_of_100_remote_computers.txt < list_of_commands.txt

I do believe you will need to specify the number of parallel jobs to be executed.
According to the Parallel MAN:
--jobs N
-j N
--max-procs N
-P N
Number of jobslots. Run up to N jobs in parallel. 0 means as many as possible. Default is 100% which will run one job per CPU core.
And keep this in mind:
When you start more than one job with the -j option, it is reasonable
to assume that each job might not take exactly the same amount of time
to complete. If you care about seeing the output in the order that
file names were presented to Parallel (instead of when they
completed), use the --keeporder option.
Parallel Multicore at the Command Line with GNU Parallel, Admin Magazine

If the remote machines are 32 cores then you run 16*32 jobs. By default GNU Parallel uses a file handle for STDOUT and STDERR in total 16*32*2 file handles = 1024 file handles.
If you have a default GNU/Linux system you will be hitting the 1024 file handle limit.
If --ungroup runs more jobs, then that is a clear indication that you have hit the file handle limit. Use ulimit -n to increase the limit.

Related

Jobs allocate twice the cores that I request on SLURM

I am trying to understand why twice the amount of cores I request are being allocated to my sbatch jobs.
From what I can tell, my partition has 106 threads:
[.... snake_make]$ sinfo -p mypartition -o %z
S:C:T
2:26:2
Yet with the sbatch set like so for my snakemake:
module load snakemake/5.6.0
snakemake -s snake_make_tetragonula --cluster-config cluster.yaml --jobs 70
--cluster "sbatch -n 4 -M {cluster.cluster} -A {cluster.account} -p {cluster.partition}"
--latency-wait 10
Each job is being allocated 8 cores instead of 4. When I run squeue, I see that it is only able to run as many as 12 jobs at a time, suggesting that it is using 8 cores for each job despite me specifying 4 threads. Also when I look at my job usage on XDMoD, I see that only half of the cpus on the job are getting used. How can I use exactly as many cpus as I want and not double that amount, like it is currently running? I have also tried
--ntasks=1 --cpus-per-task=4
which still doubled it to 8. Thanks.
Slurm can only allocate cores, not threads. So, with such a configuration:
S:C:T
2:26:2
two threads are allocated to jobs for each core being requested. Two hardware threads cannot be allocated to distinct jobs.
You can try with
--ntasks=1 --cpus-per-task=2 --threads-per-core=2
But, if your computation is CPU-intensive, this can make your jobs slower.

Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.
A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

gnu parallel one job per processor

I am trying to use gnu parallel GNU parallel (version 20160922)
to launch a large number of protein docking jobs (using UCSF Dock 6.7). I am running on a high performance cluster with several dozen nodes each with 28-40 cores. The system is running CentOS 7.1.1503, and uses torque for job management.
I am trying to submit each config file in dock.n.d to the dock executable, one per core on the cluster. Here is my PBS file:
#PBS -l walltime=01:00:00
#PBS -N pardock
#PBS -l nodes=1:ppn=28
#PBS -j oe
#PBS -o /home/path/to/pardock.log
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE temp.txt
#f=$(pwd)
ls dock.in.d/*.in | parallel -j 300 --sshloginfile $PBS_NODEFILE "/path/to/local/bin/dock6 -i {} -o {}.out"
This works fine on a single node as written above. But when I scale up to, say, 300 processors (with -l procs=300) accross several nodes I begin to get these errors:
parallel: Warning: ssh to node026 only allows for 99 simultaneous logins.
parallel: Warning: You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on node026.
What I do not understand is why there are so many logins. Each node only has 28-40 cores so, as specified in $PBS_NODEFILE, I would expect there to only be 28-40 SSH logins at any point in time on these nodes.
Am I misunderstanding or misexecuting something here? Please advise what other information I can provide or what direction I should go to get this to work.
UPDATE
So my problem above was the combination of -j 300 and the use of $PBS_NODEFILE, which has a separate entry for each core on each node. So in that case it seems I should used -j 1. But then, all the jobs seem to run on a single node.
So my question remains, how to get gnu parallel to balance the jobs between nodes, utilizing all cores, but not creating an excessive number of SSH logins due to multiple jobs per core.
Thank you!
You are asking GNU Parallel to ignore the number of cores and run 300 jobs on each server.
Try instead:
ls dock.in.d/*.in | parallel --sshloginfile $PBS_NODEFILE /path/to/local/bin/dock6 -i {} -o {}.out
This will default to --jobs 100% which is one job per core on all machines.
If you are not allowed to use all cores on the machines, you can in prepend X/ to the hosts in --sshloginfile to force X as the number of cores:
28/server1.example.com
20/server2.example.com
16/server3.example.net
This will force GNU Parallel to skip the detection of cores, and instead use 28, 20, and 16 respectively. This combined with -j 100% can control how many jobs you want started on the different servers.

PBS: job on two nodes uses memory of only one

I am trying to run a job (python code) on cluster using MPI. There is 63GB of memory available on each node.
When I run it on one node, I specify PBS parameters with (only relevant parameters are listed here):
#PBS -l mem=60GB
#PBS -l nodes=node01.cluster:ppn=32
time mpiexec -n 32 python code.py
Than works just fine.
Since PBS man page says mem is memory per entire job, my parameters when trying to run it on two nodes, are
#PBS -l mem=120GB
#PBS -l nodes=node01.cluster:ppn=32+node02.cluster:ppn=32
time mpiexec -n 64 python code.py
This doesn't work (qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max mem requirement). It fails even if I set mem=70GB for example (in case system needs some more memory).
If I set mem=60GB when trying to use both nodes, I get
=>> PBS: job killed: mem job total xx kb exceeded limit yy kb.
I tried it with pmem as well (that's pmem=1875MB), but no success.
My question is: How can I use entire 120GB of memory?
Torque / PBS ignores the mem resource unless the job uses a single node (see here):
Maximum amount of physical memory used by the job. (Ignored on Darwin, Digital Unix, Free BSD, HPUX 11, IRIX, NetBSD, and SunOS. Also ignored on Linux if number of nodes is not 1. Not implemented on AIX and HPUX 10.)
You should instead use the pmem resource that limits the memory per job process. With ppn=32 you should set pmem to 1920MB in order to get 60 GB per node. In that case you should mind that pmem does not allow flexible distribution of memory between the processes running on the node the same way mem does (since the latter is accounted as an aggregated value while pmem applies to each process individually).

shell script to loop and start processes in parallel?

I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you
Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.
You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources