GNU Parallel with processes that further fork - parallel-processing

Consider the file Processes.txt
./MyProcess 1 -nbThreads 2
./MyProcess 2 -nbThreads 2
./MyProcess 3 -nbThreads 2
, where each MyProcess will attempt to use two cores. Now consider running
parallel -j 3 :::: Processes.txt
The call to parallel specifically indicate to use no more than 3 cores. Will parallel allow MyProcess to further fork and the whole thing will use 6 cores or will it somehow enforce the three processes MyProcess to using one core each only?

It will run three processes at once and if they choose to create further processes it will neither know nor care.
(Hattip to: Mark Setchell)

Related

Problems with Orca and OpenMPI for parallel jobs

Hello to the community:
I recently started to use ORCA software for some quantum calculation but I have been having a lot of problems to lunch a parallel calculation in the cluster of my University.
To install Orca I used the static version:
orca_4_2_1_linux_x86-64_openmpi314.tar.xz.
In a shared direction of the cluster (/data/shared/opt/ORCA/).
And putted in my ~/.bash_profile:
export PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$PATH"
export LD_LIBRARY_PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$LD_LIBRARY_PATH"
For the installation of the corresponding OpenMPI version (3.1.4)
tar -xvf openmpi-3.1.4.tar.gz
cd openmpi-3.1.4
./configure --prefix="/data/shared/opt/ORCA/openmpi314/"
make -j 10
make install
When I use the frontend server all is wonderful:
With a .sh like this:
#! /bin/bash
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) test.inp > test.out
and an input like this:
# Computation of myjob at b3lyp/6-31+G(d,p)
%pal nprocs 10 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C 0 0 0
O 0 0 1.5
*
The problem appears when I use the nodes:
.inp file:
#! Computation at RKS B3LYP/6-31+G(d,p) for cis1_bh267_m_Cell_152
%pal nprocs 12 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C -4.38728130 0.21799058 0.17853303
C -3.02072869 0.82609890 -0.29733316
F -2.96869122 2.10937041 0.07179384
F -3.01136328 0.87651596 -1.63230798
C -1.82118365 0.05327804 0.23420220
O -2.26240947 -0.92805650 1.01540713
C -0.53557484 0.33394113 -0.05236121
C 0.54692198 -0.46942807 0.50027196
O 0.31128292 -1.43114232 1.22440290
C 1.93990391 -0.12927675 0.16510948
C 2.87355011 -1.15536140 -0.00858832
C 4.18738231 -0.82592189 -0.32880964
C 4.53045856 0.52514329 -0.45102225
N 3.63662927 1.52101319 -0.26705841
C 2.36381718 1.20228695 0.03146190
F -4.51788749 0.24084604 1.49796862
F -4.53935644 -1.04617745 -0.19111502
F -5.43718443 0.87033190 -0.30564680
H -1.46980819 -1.48461498 1.39034280
H -0.26291843 1.15748249 -0.71875720
H 2.57132559 -2.20300864 0.10283592
H 4.93858460 -1.60267627 -0.48060140
H 5.55483009 0.83859415 -0.70271364
H 1.67507560 2.05019549 0.17738396
*
.sh file (Slurm job):
#!/bin/bash
#SBATCH -p deflt #which partition I want
#SBATCH -o cis1_bh267_m_Cell_152_myjob.out #path for the slurm output
#SBATCH -e cis1_bh267_m_Cell_152_myjob.err #path for the slurm error output
#SBATCH -c 12 #number of cpu(logical cores)/task (task is normally an MPI process, default is one and the option to change it is -n)
#SBATCH -t 2-00:00 #how many time I want the resources (this impacts the job priority as well)
#SBATCH --job-name=cis1_bh267_m_Cell_152 #(to recognize your jobs when checking them with "squeue -u USERID")
#SBATCH -N 1 #number of node, usually 1 when no parallelization over nodes
#SBATCH --nice=0 #lowering your priority if >0
#SBATCH --gpus=0 #number of gpu you want
# This block is echoing some SLURM variables
echo "Jobid = $SLURM_JOBID"
echo "Host = $SLURM_JOB_NODELIST"
echo "Jobname = $SLURM_JOB_NAME"
echo "Subcwd = $SLURM_SUBMIT_DIR"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
# This block is for the execution of the program
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) ${SLURM_JOB_NAME}.inp > ${SLURM_JOB_NAME}.log --use-hwthread-cpus
I used the --use-hwthread-cpus flag as a recommendation but the same problem appears with and without this flag.
All the error is:
There are not enough slots available in the system to satisfy the 12 slots that were requested by the application: /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi
Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch.
*[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run*
When I go to the output of the calculation, it looks like start to run but when launch the parallel jobs fail and give:
ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 12 --use-hwthread-cpus /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi cis1_bh267_m_Cell_448.int.tmp cis1_bh267_m_Cell_448
[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run
We have two kind of nodes on the cluster:
A punch of them are:
Xeon 6-core E-2136 # 3.30GHz (12 logical cores) and Nvidia GTX 1070Ti
And the other ones:
AMD Epyc 24-core (24 logical cores) and 4x Nvidia RTX 2080Ti
Using the command scontrol show node the details of one node of each group are:
First Group:
NodeName=fang1 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=12.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:gtx1070ti:1
NodeAddr=fang1 NodeHostName=fang1 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=15923 AllocMem=0 FreeMem=171 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=7961 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,debug,long
BootTime=2020-10-27T09:56:18 SlurmdStartTime=2020-10-27T15:33:51
CfgTRES=cpu=12,mem=15923M,billing=12,gres/gpu=1,gres/gpu:gtx1070ti=1
AllocTRES=cpu=12,gres/gpu=1,gres/gpu:gtx1070ti=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Second Group
NodeName=fang50 Arch=x86_64 CoresPerSocket=24
CPUAlloc=48 CPUTot=48 CPULoad=48.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx2080ti:4
NodeAddr=fang50 NodeHostName=fang50 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=64245 AllocMem=0 FreeMem=807 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=32122 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,long
BootTime=2020-12-15T10:09:43 SlurmdStartTime=2020-12-15T10:14:17
CfgTRES=cpu=48,mem=64245M,billing=48,gres/gpu=4,gres/gpu:rtx2080ti=4
AllocTRES=cpu=48,gres/gpu=4,gres/gpu:rtx2080ti=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I use in the script of Slurm the flag -c, --cpus-per-task = integer; and in the input for Orca the command %pal nprocs integer end. I tested different combinations of this two parameters in order to see if I am using more CPU than the available:
-c, --cpus-per-task = integer
%pal nprocs integer end
None
6
None
3
None
2
1
2
1
12
2
6
3
4
12
12
With different amount of memories: 8000 MBi and 2000 MBi (my total memory is around 15 GBi). And in all the cases the same error appears. I am not an expert user neither in ORCA non in informatic (but maybe you guess this for the extension of the question), so maybe the solution is simple but I really don’t have it, Idon't know what's going on!
A lot of thanks in advance,
Alejandro.
Faced the same issue.
Explicit declaration --prefix ${OMPI_HOME} directly as ORCA parameter and using of static linked ORCA version helps me:
export RSH_COMMAND="/usr/bin/ssh"
export PARAMS="--mca routed direct --oversubscribe -machinefile ${HOSTS_FILE} --prefix ${OMPI_HOME}"
$ORCA_DIR/orca $WORKDIR/$JOBFILE.inp "$PARAMS" > $WORKDIR/$JOBFILE.out
Also, It's better to build OpenMPI 3.1.x with --disable-builtin-atomics flag.
Thank you #Alexey for your answer. And sorry for the wrong Tag, like I said, I am pretty rookie on this stuff.
The problem was not in the Orca or OpenMPI configuration but in the bash script used for scheduled the Slurm job.
I thought that the entire Orca job itself was what Slurm call a "task". For that reason I declared the flag --cpus-per-task equal to the number of parallel jobs that I want to do with Orca. But the problem is that each parallel Orca job (that is launch using OpenMPI) is a task for Slurm. Therefore with my Slurm script I was reserving a node with at least 12 CPU, but when Orca launch their parallel jobs, each one ask for 12 CPU, so: "There are not enough slots available ..." because I needed 144 CPU.
The rest of the cases in the table of my Question fails for another reason. I was launching at the same time 5 different Orca calculation. Now, because --cpus-per-task could be None, 1, 2 or 3; the five calculation might enter in the same node or in another node with this amount of free CPU, but when Orca ask for the parallel jobs, fail again because there are not this amount of CPU on the node.
The solution that I found is pretty simple. On the .sh script for Slurm I putted this:
#SBATCH --mincpus=n*m
#SBATCH --ntasks=n
#SBATCH --cpus-per-task m
Instead of only:
#SBATCH --cpus-per-task m
Where n will be equal to the number of parallel jobs specified on the Orca input (%pal nprocs n end) and m the number of CPU that you want to use for each parallel Orca job.
In my case I used n = 12, m = 1. With the flag --mincpus I ensured to take a node with at least 12 CPU and allocated them. With the --cpus-per-task is pretty evident what this flag do (even for me :-) ), which, by the way, has a default value of 1 and I don't know if more than 1 CPU for each OpenMPI Orca job improve the velocity of the calculation. And --ntasks gives the information to Slurm of how many task you will do.
Of course if you know the number of task and the CPU per task is easy to know how many CPU you need to reserve, but I don't know if this is easy to Slurm too :-). So, to be sure that I allocate the correct number of CPU i used --mincpus flag, but maybe is not needed. The thing is that it works now ^_^.
It is also important to take into account the amount of memory that you declare in the input of Orca in order of do not exceed the available memory. For example, if you have 12 task and a RAM of 15000 MBi, the right amount of memory to declared should be no more than 15000/12 = 1250 MBi
I had a similar problem with parallel jobs before. The slurm also output not enough slots error.
My solution is to change parallel threads into parallel processes. For my system is to change
#SBATCH -c 24
into
#SBATCH -n 24
and everything works just fine.

parallel computing in multiple cores for data which is indepedently run with the program

I have a simulation program in fortran which takes the input from a .dat. This file has 100.000 lines which takes really long to run. The program take the first line, run all the simulations and write in a .out the result and pass to the next line. I have a computer with 16 cpu so how can I do to split my data in 16 parts and run it separatly in each of the cpus? I am running in a machine with ubuntu. It is totally independent each line from the other.
For example my data is HeadData10000.dat, then I have a file simulation.ini with the name of the input data in this case: HeadData10000.dat and with the name of the output data. So the file simulation.ini will look like that
HeadData10000.dat
outputdata.out
Then now I have two computer so I split my HeadData10000.dat y two files and I do two simulation.ini for each input data and I run it like this in each computer: ./simulation.exe<./simulation.ini.
Assuming your list of 100,000 jobs is called "jobs.txt" and looks like this:
JobA
JobB
JobC
JobD
You could run this:
parallel 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
If you want to do a dry run to see what that would do without doing anything:
parallel --dry-run 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
Sample Output
printf "JobA\nJobA.out" | ./simulation.exe
printf "JobB\nJobB.out" | ./simulation.exe
printf "JobC\nJobC.out" | ./simulation.exe
printf "JobD\nJobD.out" | ./simulation.exe
If you have multiple servers available, look at using the -S parameter to GNU Parallel to spread the jobs across the machines. Also, look at the --eta and --bar parameters for getting progress reports.
I used printf "line1 \n line2" to generate two lines of input in order to avoid having to create, and later delete 100,000 files.
By default, GNU Parallel will keep 1 job per CPU core running, so there will always be 16 jobs running on your 16-core machine, but you can change that to, say, 8 if you want to with parallel -j 8. You can also specify the number of jobs to run on your second (and subsequent) machines.

Gnu Parallel: Does parallel reload program for every job?

Suppose I have a program that loads significant content before running...but this is a one time slowdown.
Next, I write:
cat ... | parallel -j 8 --spreadstdin --block $sz ... ./mycode
Will this induce the load overhead every single job?
If it does induce the overhead, is there a way to avoid it?
As #Barmar says, ./mycode is started for each block in your example.
But since you do not use -k in your example you may be able to use --round-robin.
... | parallel -j 8 --spreadstdin --round-robin --block $sz ... ./mycode
This will start 8 ./mycodes (but not one per block) and give blocks to any process that is ready to read.
This example shows that more blocks are given to process 11 and 10 than process 4 and 5 because 4 and 5 read slower:
seq 1000000 |
parallel -j8 --tag --roundrobin --pipe --block 1k 'pv -qL {}0000 | wc' ::: 11 4 5 6 9 8 7 10
parallel doesn't know anything about the internal workings of the program you're running with it. Each instance runs independently, there's no way that one invocation's initialization can be copied over to the others.
If you want the application to initialize once and then run multiple instances in parallel, you need to design that into the application itself. It should load the data, then use fork() to create multiple processes that use this data.

optimize parallelisation in SLURM cluster: the case of genome alignemnt

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.
bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).
Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:
sbatch -c 2 -n 3 bwa.sh
where bwa.sh containes:
cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done
Do you have any suggestion? Or can you improve/correct my reasoning?
With -c 2 you are asking to have 2 CPUs per task.
With -n 3 you are asking to have 3 tasks.
That configuration prepares a set of resources that comprises 6 CPUs in up to 3 different nodes. But your script only used 3 CPUs (-t 3), so you are wasting resources and probably using resources that does not belong to you (because the task will use 3 CPUs and you only asked for 2 CPUs per task).
For that specific script, -c 3 is the proper parameter (the other defaults to one task):
sbatch -c 3 bwa.sh

Is there a way to flush stdout on process termination for parallel processes

I'm running several independent programs on a single machine in parallel.
The processes (say 100) are all relatively short (<5 minutes) and their output is limited to a few hundred lines (~kilobytes).
Usually the output in a terminal then becomes mangled because the processes write directly to the same buffer. I would like these outputs to be un-mangled so that it's easier to debug certain processes. I could write these outputs to temporary files but I would like to limit disk IO and would prefer another method if possible. It would require cleaning up and probably won't really improve code readability.
Is there any shell native method that allows buffers to be PID separated which then flushes to stdout/stderr when the process terminates ? Do you see any other way to do this ?
Update
I ended up using the tail -n 1000000 trick from the comment of #Gem. Since the commands I'm using are long and (covering multiple lines) and I was already using subshells ( ... ) & that was a quite minimal change from ( ... ) & to ( ... ) 2>&1 | tail -n 1000000 &.
You can do that with GNU Parallel. Use -k to keep the output in order and ::: to separate the arguments you want passed to your program.
Here we run 4 instances of echo in parallel:
parallel -k echo {} ::: {0..4}
0
1
2
3
4
Now add in --tag to tag your output lines with the filenames or parameters you are using:
parallel --tag -k 'echo "Line 1, param {}"; echo "Line 2, param {}"' ::: {1..4}
1 Line 1, param 1
1 Line 2, param 1
2 Line 1, param 2
2 Line 2, param 2
3 Line 1, param 3
3 Line 2, param 3
4 Line 1, param 4
4 Line 2, param 4
You should notice that each line is tagged on the left side with the parameters and that the two lines from each job are kept together.
You can now specify how your output is organised.
Use --group to group output by job
Use --line-buffer to buffer a line at a time
Use --ungroup if you want output all mixed up, but as soon as available
Sounds like you just want syslog, or rather logger its Bash interface. Example:
echo "Something happened!" | logger -i -p local0.notice
If you insist on getting output to stderr too use --stderr. rsyslog will handle buffering, atomic writes, etc, and is presumably pretty good at optimizing disk I/O. However you could also easily configure rsyslog to route the log facility (i.e. local0 or what ever you choose to use) where ever you want, such as on a tmpfs or dedicated disk, or even over TCP. See /etc/rsyslog.conf.

Resources