Running a bash script on nodes srun uses for an mpi job - bash

I can launch an mpi job across multiple compute nodes using a slurm batch script and srun. As part of the slurm script, I want to launch a shell script that runs on the nodes the job is using to collect information (using the top command) about the job tasks running on that node. I want the shell script to run at the node level, rather than the task level. The shell script works fine running on just a single compute node, and for jobs using a single compute node I can run it in the background as part of the slurm script. But its not clear how to get it to run on multiple compute nodes using srun. I've tried using multiple srun commands in the slurm batch script, but the shell script only starts on on compute node.

I figured this out. I create a shell script wrapper to invoke the mpi code and then in the slurm script I use srun on the wrapper script. In the wrapper script I have the following conditional to invoke my shell script (sampleTop2.sh) to run one instance on each of the allocated compute nodes.
if (( ( SLURM_PROCID % SLURM_NTASKS_PER_NODE) == 0 ))
then
./sampleTop2.sh $USER $SLURMD_NODENAME 10 &
fi

Related

What does the --ntasks or -n tasks does in SLURM?

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):
sbatch does not launch tasks, it requests an allocation of resources
and submits a batch script. This option advises the Slurm controller
that job steps run within the allocation will launch a maximum of
number tasks and to provide for sufficient resources. The default is
one task per node, but note that the --cpus-per-task option will
change this default.
the specific part I do not understand what it means is:
run within the allocation will launch a maximum of number tasks and to
provide for sufficient resources.
I have a few questions:
I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.
I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.
The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script.
This may be two separate commands separated by an & or two commands used in a bash pipe (|).
For example
Using the default ntasks=1
#!/bin/bash
#SBATCH --ntasks=1
srun sleep 10 &
srun sleep 12 &
wait
Will throw the warning:
Job step creation temporarily disabled, retrying
The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished.
This job will finish in around 22 seconds. To break this down:
sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515058 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.0 2018-12-13T20:51:44 2018-12-13T20:51:56 00:00:12 1
515058.1 2018-12-13T20:51:56 2018-12-13T20:52:06 00:00:10 1
Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.
To run both of these commands simultaneously:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
Running the same sacct command as specified above
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515064 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.0 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 1
515064.1 2018-12-13T21:34:08 2018-12-13T21:34:18 00:00:10 1
Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.
Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.
Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.
Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.
The "--ntasks" options specifies how many instances of your command are executed.
For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.
In contrast the option "--cpus-per-task" specify how many CPUs each task can use.
Your output surprises me as well. Have you launched your command in the script or via srun?
Does you script look like:
#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello
This should always output only a single line, because the script is only executed on the submitting node not the worker.
If your script look like
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.
Tasks are processes that a job executes in parallel in one or more nodes. sbatch allocates resources for your job, but even if you request resources for multiple tasks, it will launch your job script in a single process in a single node only. srun is used to launch job steps from the batch script. --ntasks=N instructs srun to execute N copies of the job step.
For example,
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
means that you want to run two processes in parallel, and have each process access two CPUs. sbatch will allocate four CPUs for your job and then start the batch script in a single process. Within your batch script, you can create a parallel job step using
srun --ntasks=2 --cpus-per-task=2 step.sh
This will run two processes in parallel, both of them executing the step.sh script. From the same job, you could also run
srun --ntasks=1 --cpus-per-task=4 step.sh
This would launch a single process that can access all the four GPUs (although it would issue a warning).
It's worth noting that within the allocated resources, your job script is free to do anything, and it doesn't have to use srun to create job steps (but you need srun to launch a job step in multiple nodes). For example, the following script will run both steps in parallel:
#!/bin/bash
#SBATCH --ntasks=1
step1.sh &
step2.sh &
wait
If you want to launch job steps using srun and have two different steps run in parallel, then your job needs to allocate two tasks, and your job steps need to request only one task. You also need to provide the --exclusive argument to srun, for the job steps to use separate resources.
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 --exclusive step1.sh &
srun --ntasks=1 --exclusive step2.sh &
wait

Oozie fork running only 2 forks parallely

I am running an oozie workflow job which has a fork node. Fork node directs the workflow to 4 different sub-workflows which in turn are calling shell scripts.
Ideally all 4 shell scripts were suppose to execute parallely but for me only 2 shell scripts are executing parallely.
Could someone help me to address this issue?

Force shell script to run tasks in sequence

I'm running a shell scripts that executes several tasks. The thing is that the script does not wait for a task to end before starting the next one. My script should work differently, waiting for one task to be completed before the next one to start. Is there a way to do that? My script looks like this
sbatch retr.sc 19860101 19860630
scp EN/EN1986* myhostname#myhost.it:/storage/myhostname/MetFiles
the first command runs retr.sc, that retrieves files and it takes half an hour roughly. The second command, though, is run right soon, moving just some files to destination. I wish the scp command to be run only when the first is complete.
thanks in advance
You have several options:
use srun rather than sbatch: srun retr.sc 19860101 19860630
use sbatch for the second command as well, and make it depend on the first one
like this:
RES=$(sbatch retr.sc 19860101 19860630)
sbatch --depend=after:${RES##* } --wrap "scp EN/EN1986* myhostname#myhost.it:/storage/myhostname/MetFiles"
create one script that incorporates both retr.sc and scp and submit that script.
sbatch exits immediately on submitting the job to slurm.
salloc will wait for the job to finish before exiting.
from the man page:
$ salloc -N16 xterm
salloc: Granted job allocation 65537
(at this point the xterm appears, and salloc waits for xterm to exit)
salloc: Relinquishing job allocation 65537
Thanks for you replies
I've sorted out this way
RES=$(sbatch retr.sc $date1 $date2)
array=(${RES// / })
JOBID=${array[3]}
year1={date1:0:4}
sbatch --dependency=afterok:${JOBID} scp.sh $year1
where scp.sh is the script for transferring the file to my local machine

How can I tell if a PBS script was called by bash or qsub

I have a PBS script that processes several environment variables. PBS is a wrapper for bash that sends the bash script to a job scheduling queue. The processed variables form a command to run a scientific application. A PBS script is written in bash with additional information for the job scheduler encoded in the bash comments.
How can I determine programmatically if my script was called by qsub, the command that interprets PBS scripts, or if it as called by bash?
If the script is running under bash I would like to treat the call as a dry run and only print out the command that was generated. In that way it bypasses the job queue entirely.
This may not be completely robust, but one heuristic which may work is to test for the existence of any of the following environmental variables which tend to be defined under qsub, as listed here.
PBS_O_HOST (the name of the host upon which the qsub command is running)
PBS_SERVER (the hostname of the pbs_server which qsub submits the job to)
PBS_O_QUEUE (the name of the original queue to which the job was submitted)
PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command)
PBS_ARRAYID (each member of a job array is assigned a unique identifier)
PBS_ENVIRONMENT (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job)
PBS_JOBID (the job identifier assigned to the job by the batch system)
PBS_JOBNAME (the job name supplied by the user)
PBS_NODEFILE (the name of the file contain the list of nodes assigned to the job)
PBS_QUEUE (the name of the queue from which the job was executed from)
PBS_WALLTIME (the walltime requested by the user or default walltime allotted by the scheduler)
You can check the parent caller of bash:
CALLER=$(ps -p "$PPID" -o comm=)
if [[ <compare $CALLER with expected process name> ]]; then
<script was called by qsub or something>
fi
Extra note: Bash always has an unexported variable set: $BASH_VERSION so if it's set you'd be sure that the script is running with Bash. The question left would just be about which one called it.
Also, don't run the check inside a subshell () as you probably would get from $PPID the process of same shell, not the caller.
If your script is called with deeper levels in which case $PPID would not be enough, you can always recursively scan the parent pids with ps -p <pid> -o ppid=.

Making qsub block until job is done?

Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x myscript.sh
will not return until myscript.sh finishes execution.
In PBS you can use qsub -Wblock=true <command>

Resources