Parallel processing with dependencies on a SGE cluster - parallel-processing

I'm doing some experiments on a computing cluster. My algorithm has two steps. The first one writes its outputs to some files which will be used by the second step. The dependecies are 1 to n meaning one step2 programs needs the output of n step1 program. I'm not sure what to do neither waist cluster resources nor keep the head node busy. My current solution is:
submit script (this runs on the head node)
for different params, p:
run step 1 with p
sleep some time based on the an estimate of how much step 1 takes
for different params, q:
run step 2 with q
step 2 algorithm (this runs on the computing nodes)
while files are not ready:
sleep a few minutes
do the step 2
Is there any better way to do this?

SGE provides both job dependencies and array jobs for that. You can submit your phase 1 computations an array job and then submit the phase 2 computation as a dependent job using the qsub -hold_jid <phase 1 job ID|name> .... This will make the phase 2 job wait until all the phase 1 computations have finished and then it will be released and dispatched. The phase 1 computations will run in parallel as long as there are enough slots in the cluster.
In a submission script it might be useful to specifiy holds by job name and name each array job in a unique way. E.g.
mkdir experiment_1; cd experiment_1
qsub -N phase1_001 -t 1-100 ./phase1
qsub -hold_jid phase1_001 -N phase2_001 ./phase2 q1
cd ..
mkdir experiment_2; cd experiment_2
qsub -N phase1_002 -t 1-42 ./phase1 parameter_file
qsub -hold_jid phase1_002 -N phase2_002 ./phase2 q2
cd ..
This will schedule 100 executions of the phase1 script as the array job phase1_001 and another 42 executions as the array job phase1_002. If there are 142 slots on the cluster, all 142 executions will run in parallel. Then one execution of the phase2 script will be dispatched after all tasks in the phase1_001 job have finished and one execution will be dispatched after all tasks in the phase1_002 job have finished. Again those can run in parallel.
Each taks in the array job will receive a unique $SGE_TASK_ID value ranging from 1 to 100 for the tasks in job phase1_001 and from 1 to 42 for the tasks in job phase1_002. From it you can compute the p parameter.

Related

Problems with Orca and OpenMPI for parallel jobs

Hello to the community:
I recently started to use ORCA software for some quantum calculation but I have been having a lot of problems to lunch a parallel calculation in the cluster of my University.
To install Orca I used the static version:
orca_4_2_1_linux_x86-64_openmpi314.tar.xz.
In a shared direction of the cluster (/data/shared/opt/ORCA/).
And putted in my ~/.bash_profile:
export PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$PATH"
export LD_LIBRARY_PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$LD_LIBRARY_PATH"
For the installation of the corresponding OpenMPI version (3.1.4)
tar -xvf openmpi-3.1.4.tar.gz
cd openmpi-3.1.4
./configure --prefix="/data/shared/opt/ORCA/openmpi314/"
make -j 10
make install
When I use the frontend server all is wonderful:
With a .sh like this:
#! /bin/bash
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) test.inp > test.out
and an input like this:
# Computation of myjob at b3lyp/6-31+G(d,p)
%pal nprocs 10 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C 0 0 0
O 0 0 1.5
*
The problem appears when I use the nodes:
.inp file:
#! Computation at RKS B3LYP/6-31+G(d,p) for cis1_bh267_m_Cell_152
%pal nprocs 12 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C -4.38728130 0.21799058 0.17853303
C -3.02072869 0.82609890 -0.29733316
F -2.96869122 2.10937041 0.07179384
F -3.01136328 0.87651596 -1.63230798
C -1.82118365 0.05327804 0.23420220
O -2.26240947 -0.92805650 1.01540713
C -0.53557484 0.33394113 -0.05236121
C 0.54692198 -0.46942807 0.50027196
O 0.31128292 -1.43114232 1.22440290
C 1.93990391 -0.12927675 0.16510948
C 2.87355011 -1.15536140 -0.00858832
C 4.18738231 -0.82592189 -0.32880964
C 4.53045856 0.52514329 -0.45102225
N 3.63662927 1.52101319 -0.26705841
C 2.36381718 1.20228695 0.03146190
F -4.51788749 0.24084604 1.49796862
F -4.53935644 -1.04617745 -0.19111502
F -5.43718443 0.87033190 -0.30564680
H -1.46980819 -1.48461498 1.39034280
H -0.26291843 1.15748249 -0.71875720
H 2.57132559 -2.20300864 0.10283592
H 4.93858460 -1.60267627 -0.48060140
H 5.55483009 0.83859415 -0.70271364
H 1.67507560 2.05019549 0.17738396
*
.sh file (Slurm job):
#!/bin/bash
#SBATCH -p deflt #which partition I want
#SBATCH -o cis1_bh267_m_Cell_152_myjob.out #path for the slurm output
#SBATCH -e cis1_bh267_m_Cell_152_myjob.err #path for the slurm error output
#SBATCH -c 12 #number of cpu(logical cores)/task (task is normally an MPI process, default is one and the option to change it is -n)
#SBATCH -t 2-00:00 #how many time I want the resources (this impacts the job priority as well)
#SBATCH --job-name=cis1_bh267_m_Cell_152 #(to recognize your jobs when checking them with "squeue -u USERID")
#SBATCH -N 1 #number of node, usually 1 when no parallelization over nodes
#SBATCH --nice=0 #lowering your priority if >0
#SBATCH --gpus=0 #number of gpu you want
# This block is echoing some SLURM variables
echo "Jobid = $SLURM_JOBID"
echo "Host = $SLURM_JOB_NODELIST"
echo "Jobname = $SLURM_JOB_NAME"
echo "Subcwd = $SLURM_SUBMIT_DIR"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
# This block is for the execution of the program
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) ${SLURM_JOB_NAME}.inp > ${SLURM_JOB_NAME}.log --use-hwthread-cpus
I used the --use-hwthread-cpus flag as a recommendation but the same problem appears with and without this flag.
All the error is:
There are not enough slots available in the system to satisfy the 12 slots that were requested by the application: /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi
Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch.
*[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run*
When I go to the output of the calculation, it looks like start to run but when launch the parallel jobs fail and give:
ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 12 --use-hwthread-cpus /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi cis1_bh267_m_Cell_448.int.tmp cis1_bh267_m_Cell_448
[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run
We have two kind of nodes on the cluster:
A punch of them are:
Xeon 6-core E-2136 # 3.30GHz (12 logical cores) and Nvidia GTX 1070Ti
And the other ones:
AMD Epyc 24-core (24 logical cores) and 4x Nvidia RTX 2080Ti
Using the command scontrol show node the details of one node of each group are:
First Group:
NodeName=fang1 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=12.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:gtx1070ti:1
NodeAddr=fang1 NodeHostName=fang1 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=15923 AllocMem=0 FreeMem=171 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=7961 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,debug,long
BootTime=2020-10-27T09:56:18 SlurmdStartTime=2020-10-27T15:33:51
CfgTRES=cpu=12,mem=15923M,billing=12,gres/gpu=1,gres/gpu:gtx1070ti=1
AllocTRES=cpu=12,gres/gpu=1,gres/gpu:gtx1070ti=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Second Group
NodeName=fang50 Arch=x86_64 CoresPerSocket=24
CPUAlloc=48 CPUTot=48 CPULoad=48.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx2080ti:4
NodeAddr=fang50 NodeHostName=fang50 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=64245 AllocMem=0 FreeMem=807 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=32122 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,long
BootTime=2020-12-15T10:09:43 SlurmdStartTime=2020-12-15T10:14:17
CfgTRES=cpu=48,mem=64245M,billing=48,gres/gpu=4,gres/gpu:rtx2080ti=4
AllocTRES=cpu=48,gres/gpu=4,gres/gpu:rtx2080ti=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I use in the script of Slurm the flag -c, --cpus-per-task = integer; and in the input for Orca the command %pal nprocs integer end. I tested different combinations of this two parameters in order to see if I am using more CPU than the available:
-c, --cpus-per-task = integer
%pal nprocs integer end
None
6
None
3
None
2
1
2
1
12
2
6
3
4
12
12
With different amount of memories: 8000 MBi and 2000 MBi (my total memory is around 15 GBi). And in all the cases the same error appears. I am not an expert user neither in ORCA non in informatic (but maybe you guess this for the extension of the question), so maybe the solution is simple but I really don’t have it, Idon't know what's going on!
A lot of thanks in advance,
Alejandro.
Faced the same issue.
Explicit declaration --prefix ${OMPI_HOME} directly as ORCA parameter and using of static linked ORCA version helps me:
export RSH_COMMAND="/usr/bin/ssh"
export PARAMS="--mca routed direct --oversubscribe -machinefile ${HOSTS_FILE} --prefix ${OMPI_HOME}"
$ORCA_DIR/orca $WORKDIR/$JOBFILE.inp "$PARAMS" > $WORKDIR/$JOBFILE.out
Also, It's better to build OpenMPI 3.1.x with --disable-builtin-atomics flag.
Thank you #Alexey for your answer. And sorry for the wrong Tag, like I said, I am pretty rookie on this stuff.
The problem was not in the Orca or OpenMPI configuration but in the bash script used for scheduled the Slurm job.
I thought that the entire Orca job itself was what Slurm call a "task". For that reason I declared the flag --cpus-per-task equal to the number of parallel jobs that I want to do with Orca. But the problem is that each parallel Orca job (that is launch using OpenMPI) is a task for Slurm. Therefore with my Slurm script I was reserving a node with at least 12 CPU, but when Orca launch their parallel jobs, each one ask for 12 CPU, so: "There are not enough slots available ..." because I needed 144 CPU.
The rest of the cases in the table of my Question fails for another reason. I was launching at the same time 5 different Orca calculation. Now, because --cpus-per-task could be None, 1, 2 or 3; the five calculation might enter in the same node or in another node with this amount of free CPU, but when Orca ask for the parallel jobs, fail again because there are not this amount of CPU on the node.
The solution that I found is pretty simple. On the .sh script for Slurm I putted this:
#SBATCH --mincpus=n*m
#SBATCH --ntasks=n
#SBATCH --cpus-per-task m
Instead of only:
#SBATCH --cpus-per-task m
Where n will be equal to the number of parallel jobs specified on the Orca input (%pal nprocs n end) and m the number of CPU that you want to use for each parallel Orca job.
In my case I used n = 12, m = 1. With the flag --mincpus I ensured to take a node with at least 12 CPU and allocated them. With the --cpus-per-task is pretty evident what this flag do (even for me :-) ), which, by the way, has a default value of 1 and I don't know if more than 1 CPU for each OpenMPI Orca job improve the velocity of the calculation. And --ntasks gives the information to Slurm of how many task you will do.
Of course if you know the number of task and the CPU per task is easy to know how many CPU you need to reserve, but I don't know if this is easy to Slurm too :-). So, to be sure that I allocate the correct number of CPU i used --mincpus flag, but maybe is not needed. The thing is that it works now ^_^.
It is also important to take into account the amount of memory that you declare in the input of Orca in order of do not exceed the available memory. For example, if you have 12 task and a RAM of 15000 MBi, the right amount of memory to declared should be no more than 15000/12 = 1250 MBi
I had a similar problem with parallel jobs before. The slurm also output not enough slots error.
My solution is to change parallel threads into parallel processes. For my system is to change
#SBATCH -c 24
into
#SBATCH -n 24
and everything works just fine.

Slurm --cpus-per-task command

Hello everyone I'm actually using a soft called RepeatMasker, in this pipeline I can run parallelized job via slurm with the command -pa
here is a doc about this command :
RepeatMasker -h
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:
RMBlast 4 cores
To estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use.
so in a slurm batch script I should add :
#SBATCH --cpus-per-task=8
RepeatMakser -pa 2, right?
since 8/4 =2
But I wondered if I should also add others #SBATCH parameters or if --cpus-per-task is sufficient ?
Thanks al ot

optimize parallelisation in SLURM cluster: the case of genome alignemnt

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.
bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).
Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:
sbatch -c 2 -n 3 bwa.sh
where bwa.sh containes:
cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done
Do you have any suggestion? Or can you improve/correct my reasoning?
With -c 2 you are asking to have 2 CPUs per task.
With -n 3 you are asking to have 3 tasks.
That configuration prepares a set of resources that comprises 6 CPUs in up to 3 different nodes. But your script only used 3 CPUs (-t 3), so you are wasting resources and probably using resources that does not belong to you (because the task will use 3 CPUs and you only asked for 2 CPUs per task).
For that specific script, -c 3 is the proper parameter (the other defaults to one task):
sbatch -c 3 bwa.sh

Invalid job array specification in slurm

I am submitting a toy array job in slurm. My command line is
$ sbatch -p development -t 0:30:0 -n 1 -a 1-2 j1
where j1 is script:
#!/bin/bash
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
When I submit this, I get an error:
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/03400/myname)...OK
--> Verifying availability of your work dir (/work/03400/myname)...OK
--> Verifying availability of your scratch dir (/scratch/03400/myname)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (PRJ-1234)...OK
sbatch: error: Batch job submission failed: Invalid job array specification
The same job works fine without the array specification:
$ sbatch -p development -t 0:30:0 -n 1 j1
This post is a bit old, but in case it happens for other people, I have had the same issue but the accepted answer did not suggest what was the problem in my case.
This error (sbatch: error: Batch job submission failed: Invalid job array specification) can also be raised when the array size is too large.
From https://slurm.schedmd.com/slurm.conf.html
MaxArraySize
The maximum job array size. The maximum job array task index value will be one less than MaxArraySize to allow for an index value of zero. Configure MaxArraySize to 0 in order to disable job array use. The value may not exceed 4000001. The value of MaxJobCount should be much larger than MaxArraySize. The default value is 1001.
To check the value, the slurm.conf file should be accessible by all slurm users (still according to 1) and may be found somewhere near /etc/slurm.conf (see https://slurm.schedmd.com/slurm.conf.html#lbAM, in my case I found it at path /etc/slurm/slurm.conf).
The syntax for your array specification is correct. But the printout you paste is not standard Slurm, I guess you are working on Stampede ; they have their own sbatch wrapper.
What you could do is use the -vvv option to sbatch to see exactly what Slurm sees:
$ sbatch -vvv -p development -t 0:30:0 -n 1 -a 1-2 j1 |& grep array
This should return
sbatch: array : 1-2
and if it does not it means the information is somehow lost somewhere.
What you can try is remove the array specification from the submission command line and insert it in the submission script, like this:
$ sbatch -p development -t 0:30:0 -n 1 j1
with j1 being
#!/bin/bash
#SBATCH -a 1-2
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
The next step is to contact the system administrators with the information you will get from running the above tests and ask for help.

Hive takes long time to launch hadoop job

I am a newbie to Hadoop and Hive. I am using Hive integration with Hadoop to execute the queries. When I submit any query, following log messages appear on console:
Hive history
file=/tmp/root/hive_job_log_root_28058#hadoop2_201203062232_1076893031.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce
tasks determined at compile time: 1 In order to change the average
load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapred.reduce.tasks= Starting Job = job_201203062223_0004,
Tracking URL =
http://:50030/jobdetails.jsp?jobid=job_201203062223_0004 Kill
Command = //opt/hadoop_installation/hadoop-0.20.2/bin/../bin/hadoop
job -kill job_201203062223_0004 Hadoop job information for Stage-1:
number of mappers: 1; number of reducers: 1 2012-03-06 22:32:26,707
Stage-1 map = 0%, reduce = 0% 2012-03-06 22:32:29,716 Stage-1 map =
100%, reduce = 0% 2012-03-06 22:32:38,748 Stage-1 map = 100%, reduce
= 100% Ended Job = job_201203062223_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 8107686 HDFS Write: 4 SUCCESS Total
MapReduce CPU Time Spent: 0 msec OK
The text mentioned in bold starts a hadoop job (that's what I believe). It takes long time to start the job. Once this line gets executed, the map reduce operations execute swiftly. Following are my questions:
Is there any way to make the launch of hadoop job faster. Is it possible to skip this phase?
Where does the value of 'Kill command' come from (in the bold text)?
Please let me know if any inputs are required.
1) Starting Job = job_201203062223_0004, Tracking URL = http: :50030/jobdetails.jsp?jobid=job_201203062223_0004
ANS: your HQL query > translated to hadoop job > hadoop will do some background work (like planning resources,data locality,stages needed to process query,launch configs,job,taskids generation etc) > launch mappers > sort && shuffle > reduce (aggregation) > result to hdfs .
The above flow is part of hadoop job life cycle, so no skipping of any..
http://namenode:port/jobtracker.jsp --- you can see ur job status with job-id :job_201203062223_0004, (Monitering)
2) Kill Command = HADOOP_HOME/bin/hadoop job -kill job_201203062223_0004
Ans : before launching your mappers, you will be showed with these lines because, hadoop works on bigdata, which may take much or less time depends on your dataset size. so at any point of time if you want to kill the job, its a help line . For any hadoop-job this line will be shown, it won't take much time to show an info line like this.
some addons with respect to your comments :
Hive is not meant for low Latency jobs , i mean immediate in time results not possible.
(plz check the hive -purposes in apache.hive)
launching overhead(refer q1s - hadoop will do some background work) is there in Hive, it cant be avoided.
Even for datasets of small size, these launching over head is there in hadoop.
PS : if you are really expecting in time quick results ( plz refer shark )
first,Hive is the tool which replace your mr work by HQL.In the background,it has lost of predefined funcitions,mr programes.Run a HQL,HADOOP Cluster will do lost of things,find the data blocks,allocating task,and so on.
Second,you can kill a job by the hadoop shell command.
If you job id is AAAAA.
you can execute below command to kill it
$HADOOP_HOME/bin/hadoop job -kill AAAAA
Launch of hadoop job can get delayed due to unavailability of resources. If you use yarn you can see that the jobs are in accepted state but not yet running. This means there is some other ongoing job that has consumed all your executors and the new query is waiting to run.
You can kill the older job by using hadoop job -kill <job_id> command or wait for it to finish.

Resources