Holding job script after completion of one simulation - bash

I run multiple serial jobs on HPC. For example, if I have 10 simulations, I use 10 cores on HPC and use each core for a simulation. However, the end time of all these simulations is different and as soon as one simulation completes, all the others stop as well. How do I hold the job script so that even if one simulation is completed, others will keep running, in simple words, job script stays on HPC. An example of my job script:
#!/bin/bash
#SBATCH --job-name=CaseName # name of the job
#SBATCH --ntasks=60 # number of requested cores
#SBATCH --cpus-per-task=1
#SBATCH --time=7-00:00:00 # time limit
#SBATCH --partition=core64 # queue
cd Folder1
for i in {1..5}
do
cd Folder$i
for j in {1..6}
do
cd SubFolder$j
application > log 2>&1 &
cd ..
done
cd ..
done
cd ..
cd LastFolder
application > log 2>&1
Is there any command I can add in job script to do so ?
Any command to use in job script to continue the jobs in hpc after simulation ends.

You need a wait at the end of your script as you run the jobs in the background and you want exit from the script when all of them finished.
from man bash:
wait [-fn] [-p varname] [id ...]
Wait for each specified child process and return
its termination status. ...
...
If id is not given, wait waits for all running background jobs...

There's something wrong with your cd logic.
Perhaps try running the cd and the application in a subshell, e.g.
(cd SubFolder$j ; application > log 2>&1 & )
Then, that way, you can be assured that every command run's concurrently and in their own subdirectory without impacting each other.

Related

Run shell script in parallel with more jobs than CPUs, and after a job is finished, instantly take the available spot [duplicate]

This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 26 days ago.
I'm running a shell script script.sh in parallel which in each of its lines goes to a folder and run a Fortran code:
cd folder1 && ./code &
cd folder2 && ./code &
cd folder3 && ./code &
cd folder4 && ./code &
..
cd folder 96 && ./code
wait
cd folder 97 && ./code
..
..
..
cd folder2500 && ./code.sh
There are around 2500 Folders and code outputs are independent from each other. I have access to 96 CPUs and each job uses around 1% of CPU, so I run 96 jobs in parallel using the & key and wait command. Due to different reasons, not all 96 jobs finish at the same time. Some of them take 40 minutes, some of them 90 minutes, an important difference. So I was wondering if it is possible that the jobs that finish earlier use the available CPUs in order to optimize the execution time.
I tried also with GNU Parallel:
parallel -a script.sh but it had the same issue, and I could not find in internet somebody with a similar problem.
You can use GNU Parallel:
parallel 'cd {} && ./code' ::: folder*
That will keep all your cores busy, starting a new job immediately as each job finishes.
If you only want to run 48 jobs in parallel, use:
parallel -j 48 ...
If you want to do a dry run and see what would run but without actually running anything, use:
parallel --dry-run ...
If you want to see a progress report, use:
parallel --progress ...
One bash/wait -n approach:
jobmax=96
jobcnt=0
for ((i=1;i<=2500;i++))
do
((++jobcnt))
[[ "${jobcnt}" -gt "${jobmax}" ]] && wait -n && ((--jobcnt)) # if jobcnt > 96 => wait for a job to finish, decrement jobcnt, then continue with next line ...
( cd "folder$i" && ./code ) & # kick off new job
done
wait # wait for rest of jobs to complete
NOTES:
when the jobs complete quickly (eg, < 1 sec) it's possible that more than one job could complete during the wait -n; start new job; wait -n cycle, in which case you could end up with less than jobmax jobs running at a time (ie, jobcnt is higher than the actual number of running jobs)
however, in this scenario where each job is expected to take XX minutes to complete the likelihood of multiple jobs completing during the wait -n; start new job; wait -n cycle should be greatly diminished (if not eliminated)

sbatch+srun: Large amount of single thread jobs

Hello friendly people,
my question is rather specific.
For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.
The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.
But, I am facing the problem that some of the jobs are never started, even though they have been submitted.
My sbatch script looks like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
wait 5s
done
Now, my error log shows the following message:
srun: Job 1846955 step creation temporarily disabled, retrying
1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?
2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?
Thanks for your help!
srun: Job 1846955 step creation temporarily disabled, retrying
This is normal, you reserve 4 x 12 CPUs and start 500 instances of srun. Only 48 instances will run, while the other will output that message. Whenever a running instance stops, a pending instance starts.
wait 5s
The wait command is used to wait for processes, not for a certain amount of time. For that, use the sleep command. The wait command must be at the end of the script. Otherwise, the job could stop before all srun instances have finished.
So the scrip should look like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
done
wait

How to run different independent parallel job on different nodes using slurm using worker/master concept?

I have a program that uses the master/salve concept for parallelization. There is a master directory and multiple worker directories. I should first run the executive file in the master directory, then go to the working directories and run the working executive in each directory. The master waits for the worker to finish their jobs and send the results to the master for further calculations. The jobs of working directories are independent of each other so they can be run on different machines (nodes). The master and workers communicate with each other using the TCP/IP communications protoco.
I'm working on a cluster with 16 nodes and each node has 28 cores with slurm job manager. I can run my jobs with 20 workers on 1 node totaly fine. currently my slurm script looks like this:
#!/bin/bash
#SBATCH -n 1 # total number of tasks requested
#SBATCH --cpus-per-task=18 # cpus to allocate per task
#SBATCH -p shortq # queue (partition) -- defq, eduq, gpuq.
#SBATCH -t 12:00:00 # run time (hh:mm:ss) - 12.0 hours in this.
cd /To-master-directory
master.exe /h :4004 &
MASTER_PID=$!
cd /To-Parent
# This is the directory that contains all worker (wrk)directories
parallel -i bash -c "cd {} ; worker.exe /h 127.0.0.1:4004" --
wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14
wrk15 wrk16 wrk17 wrk18 wrk19 wrk20
kill ${MASTER_PID}
I was wondering how can I modify this script to divide jobs running on workers between multiple nodes. For example, jobs associated with the wrk1 to wrk5 run on node 1, jobs associated with the wrk6 to wrk10 run on node 2 etc?
First, you need to let Slurm allocate distinct nodes for your job, so you need to remove the --cpus-per-task option and rather ask for 18 tasks.
Second, you need to get the hostname where the master runs as 127.0.0.1 will no longer be valid in a multi-node setup.
Third, just add srun before the call the bash in parallel. With the --exclusive -n 1 -c 1, it will dispatch each instance of the worker spawned by parallel to each of the CPUs in the allocation. They might be on the same node or on other nodes.
So the following could work (untested)
#!/bin/bash
#SBATCH -n 18 # total number of tasks requested
#SBATCH -p shortq # queue (partition) -- defq, eduq, gpuq.
#SBATCH -t 12:00:00 # run time (hh:mm:ss) - 12.0 hours in this.
cd /To-master-directory
master.exe /h :4004 &
MASTER_PID=$!
MASTER_HOSTNAME=$(hostname)
cd /To-Parent
# This is the directory that contains all worker (wrk)directories
parallel -i srun --exclusive -n 1 -c 1 bash -c "cd {} ; worker.exe /h $MASTER_HOSTNAME:4004" --
wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14
wrk15 wrk16 wrk17 wrk18 wrk19 wrk20
kill ${MASTER_PID}
Note that in your example with 18 tasks and 20 directories to process, the job will first run 18 workers and then the two additional ones will be 'micro-scheduled' whenever a previous task finishes.

Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.
A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

What does the --ntasks or -n tasks does in SLURM?

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):
sbatch does not launch tasks, it requests an allocation of resources
and submits a batch script. This option advises the Slurm controller
that job steps run within the allocation will launch a maximum of
number tasks and to provide for sufficient resources. The default is
one task per node, but note that the --cpus-per-task option will
change this default.
the specific part I do not understand what it means is:
run within the allocation will launch a maximum of number tasks and to
provide for sufficient resources.
I have a few questions:
I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.
I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.
The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script.
This may be two separate commands separated by an & or two commands used in a bash pipe (|).
For example
Using the default ntasks=1
#!/bin/bash
#SBATCH --ntasks=1
srun sleep 10 &
srun sleep 12 &
wait
Will throw the warning:
Job step creation temporarily disabled, retrying
The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished.
This job will finish in around 22 seconds. To break this down:
sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515058 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.0 2018-12-13T20:51:44 2018-12-13T20:51:56 00:00:12 1
515058.1 2018-12-13T20:51:56 2018-12-13T20:52:06 00:00:10 1
Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.
To run both of these commands simultaneously:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
Running the same sacct command as specified above
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515064 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.0 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 1
515064.1 2018-12-13T21:34:08 2018-12-13T21:34:18 00:00:10 1
Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.
Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.
Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.
Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.
The "--ntasks" options specifies how many instances of your command are executed.
For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.
In contrast the option "--cpus-per-task" specify how many CPUs each task can use.
Your output surprises me as well. Have you launched your command in the script or via srun?
Does you script look like:
#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello
This should always output only a single line, because the script is only executed on the submitting node not the worker.
If your script look like
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.
Tasks are processes that a job executes in parallel in one or more nodes. sbatch allocates resources for your job, but even if you request resources for multiple tasks, it will launch your job script in a single process in a single node only. srun is used to launch job steps from the batch script. --ntasks=N instructs srun to execute N copies of the job step.
For example,
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
means that you want to run two processes in parallel, and have each process access two CPUs. sbatch will allocate four CPUs for your job and then start the batch script in a single process. Within your batch script, you can create a parallel job step using
srun --ntasks=2 --cpus-per-task=2 step.sh
This will run two processes in parallel, both of them executing the step.sh script. From the same job, you could also run
srun --ntasks=1 --cpus-per-task=4 step.sh
This would launch a single process that can access all the four GPUs (although it would issue a warning).
It's worth noting that within the allocated resources, your job script is free to do anything, and it doesn't have to use srun to create job steps (but you need srun to launch a job step in multiple nodes). For example, the following script will run both steps in parallel:
#!/bin/bash
#SBATCH --ntasks=1
step1.sh &
step2.sh &
wait
If you want to launch job steps using srun and have two different steps run in parallel, then your job needs to allocate two tasks, and your job steps need to request only one task. You also need to provide the --exclusive argument to srun, for the job steps to use separate resources.
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 --exclusive step1.sh &
srun --ntasks=1 --exclusive step2.sh &
wait

Resources