Multiple mpirun executions at the same time? - bash

I have a list of mpirun commands serially in a bash script. I can run the bash script that causes the individual mpirun commands to be executed serially (but parallel within themselves). Now, each of these mpirun commands require only a fraction of total computational resources the system has. That means, when the first mpirun command executes, only few CPUs are working, rest should be idle. Is there a way to again 'channelize' each mpirun commands of the bash script into different sets of CPUs so that almost all the computing resources get used efficiently? (Extra: Each mpirun command executes a common python code but with different set of arguments.)

Related

SLURM: Embarrassingly parallel program inside an embarrassingly parallel program

I have a complex model written in Matlab. The model was not written by us and is best thought of as a "black box" i.e. in order to fix the relevant problems from the inside would require rewritting the entire model which would take years.
If I have an "embarrassingly parallel" problem I can use an array to submit X variations of the same simulation with the option #SBATCH --array=1-X. However, clusters normally have a (frustratingly small) limit on the maximum array size.
Whilst using a PBS/TORQUE cluster I have got around this problem by forcing Matlab to run on a single thread, requesting multiple CPUs and then running multiple instances of Matlab in the background. An example submission script is:
#!/bin/bash
<OTHER PBS COMMANDS>
#PBS -l nodes=1:ppn=5,walltime=30:00:00
#PBS -t 1-600
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON ARRAY NUMBER>
# define Matlab options
options="-nodesktop -noFigureWindows -nosplash -singleCompThread"
for sub_job in {1..5}
do
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON LOOP NUMBER (i.e. sub_job)>
matlab ${options} -r "run_model(${arg1}, ${arg2}, ..., ${argN}); exit" &
done
wait
<TIDY UP AND FINISH COMMANDS>
Can anyone help me do the equivalent on a SLURM cluster?
The par function will not run my model in a parallel loop in Matlab.
The PBS/TORQUE language was very intuitive but SLURM's is confusing me. Assuming a similarly structured submission script as my PBS example, here is what I think certain commands will result in.
--ncpus-per-task=5 seems like the most obvious one to me. Would I put srun in front of the matlab command in the loop or leave it as it is in the PBS script loop?
--ntasks=5 I would imagine would request 5 CPUs but will run in serial unless a program specifically requests them (i.e. MPI or Python-Multithreaded etc). Would I need to put srun in front of the Matlab command in this case?
I am not a big expert on array jobs but I can help you with the inner loop.
I would always use GNU parallel to run several serial processes in parallel, within a single job that has more than one CPU available. It is a simple perl script, so not difficult to 'install', and its syntax is extremely easy. What it basically does is to run some (nested) loop in parallel. Each iteration of this loop contains a (long) process, like your Matlab command. In contrast to your solution it does not submit all these processes at once, but it runs only N processes at the same time (where N is the number of CPUs you have available). As soon as one finishes, the next one is submitted, and so on until your entire loop is finished. It is perfectly fine that not all processes take the same amount of time, as soon as one CPU is freed, another process is started.
Then, what you would like to do is to launch 600 jobs (for which I substitute 3 below, to show the complete behavior), each with 5 CPUs. To do that you could do the following (whereby I have not included the actual run of matlab, but that trivially can be included):
#!/bin/bash
#SBATCH --job-name example
#SBATCH --out job.slurm.out
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 5
#SBATCH --mem 512
#SBATCH --time 30:00:00
#SBATCH --array 1-3
cmd="echo matlab array=${SLURM_ARRAY_TASK_ID}"
parallel --max-procs=${SLURM_CPUS_PER_TASK} "$cmd,subjob={1}; sleep 30" ::: {1..5}
Submitting this job using:
$ sbatch job.slurm
submits 3 jobs to the queue. For example:
$ squeue | grep tdegeus
3395882_1 debug example tdegeus R 0:01 1 c07
3395882_2 debug example tdegeus R 0:01 1 c07
3395882_3 debug example tdegeus R 0:01 1 c07
Each job gets 5 CPUs. These are exploited by the parallel command, to run your inner loop in parallel. Once again, the range of this inner loop may be (much) larger than 5, parallel takes care of the balancing between the 5 available CPUs within this job.
Let's inspect the output:
$ cat job.slurm.out
matlab array=2,subjob=1
matlab array=2,subjob=2
matlab array=2,subjob=3
matlab array=2,subjob=4
matlab array=2,subjob=5
matlab array=1,subjob=1
matlab array=3,subjob=1
matlab array=1,subjob=2
matlab array=1,subjob=3
matlab array=1,subjob=4
matlab array=3,subjob=2
matlab array=3,subjob=3
matlab array=1,subjob=5
matlab array=3,subjob=4
matlab array=3,subjob=5
You can clearly see the 3 times 5 processes run at the same time now (as their output is mixed).
No need in this case to use srun. SLURM will create 3 jobs. Within each job everything happens on individual compute nodes (i.e. as if you were running on your own system).
Installing GNU Parallel - option 1
To 'install' GNU parallel into your home folder, for example in ~/opt.
Download the latest GNU Parallel.
Make the directory ~/opt if it does not yet exist
mkdir $HOME/opt
'Install' GNU Parallel:
tar jxvf parallel-latest.tar.bz2
cd parallel-XXXXXXXX
./configure --prefix=$HOME/opt
make
make install
Add ~/opt to your path:
export PATH=$HOME/opt/bin:$PATH
(To make it permanent, add that line to your ~/.bashrc.)
Installing GNU Parallel - option 2
Use conda.
(Optional) Create a new environment
conda create --name myenv
Load an existing environment:
conda activate myenv
Install GNU parallel:
conda install -c conda-forge parallel
Note that the command is available only when the environment is loaded.
While Tom's suggestion to use GNU Parallel is a good one, I will attempt to answer the question asked.
If you want to run 5 instances of the matlab command with the same arguments (for example if they were communicating via MPI) then you would want to ask for --ncpus-per-task=1, --ntasks=5 and you should preface your matlab line with srun and get rid of the loop.
In your case, as each of your 5 calls to matlab are independent, you want to ask for --ncpus-per-task=5, --ntasks=1. This will ensure that you allocate 5 CPU cores per job to do with as you wish. You can preface your matlab line with srun if you wish but it will make little difference you are only running one task.
Of course, this is only efficient if each of your 5 matlab runs take the same amount of time since if one takes much longer then the other 4 CPU cores will be sitting idle, waiting for the fifth to finish.
You can do it with python and subprocess, in what I describe below you just set the number of nodes and tasks and that is it, no need for an array, no need to match the size of the array to the number of simulations, etc... It will just execute python code until it is done, more nodes faster execution.
Also, it is easier to decide on variables as everything is being prepared in python (which is easier than bash).
It does assume that the Matlab scripts save the output to file - nothing is returned by this function (it can be changed..)
In the sbatch script you need to add something like this:
#!/bin/bash
#SBATCH --output=out_cluster.log
#SBATCH --error=err_cluster.log
#SBATCH --time=8:00:00
#SBATCH --nodes=36
#SBATCH --exclusive
#SBATCH --cpus-per-task=2
export IPYTHONDIR="`pwd`/.ipython"
export IPYTHON_PROFILE=ipyparallel.${SLURM_JOBID}
whereis ipcontroller
sleep 3
echo "===== Beginning ipcontroller execution ======"
ipcontroller --init --ip='*' --nodb --profile=${IPYTHON_PROFILE} --ping=30000 & # --sqlitedb
echo "===== Finish ipcontroller execution ======"
sleep 15
srun ipengine --profile=${IPYTHON_PROFILE} --timeout=300 &
sleep 75
echo "===== Beginning python execution ======"
python run_simulations.py
depending on your system, read more here:https://ipyparallel.readthedocs.io/en/latest/process.html
and run_simulations.py should contain something like this:
import os
from ipyparallel import Client
import sys
from tqdm import tqdm
import subprocess
from subprocess import PIPE
def run_sim(x):
import os
import subprocess
from subprocess import PIPE
# send job!
params = [str(i) for i in x]
p1 = subprocess.Popen(['matlab','-r',f'"run_model({x[0]},{x[1]})"'], env=dict(**os.environ))
p1.wait()
return
##load ipython parallel
rc = Client(profile=os.getenv('IPYTHON_PROFILE'))
print('Using ipyparallel with %d engines', len(rc))
lview = rc.load_balanced_view()
view = rc[:]
print('Using ipyparallel with %d engines', len(rc))
sys.stdout.flush()
map_function = lview.map_sync
to_send = []
#prepare variables <-- here you should prepare the arguments for matlab
####################
for param_1 in [1,2,3,4]:
for param_2 in [10,20,40]:
to_send.append([param_1, param_2])
ind_raw_features = lview.map_async(run_sim,to_send)
all_results = []
print('Sending jobs');sys.stdout.flush()
for i in tqdm(ind_raw_features,file=sys.stdout):
all_results.append(i)
You also get a progress bar in the stdout, which is nice... you can also easily add a check to see if the output files exist and ignore a run.

Bash script to identify which processes are launched on system start-up

Need a script in Bash that identifies which are the processes lunched when system start-up and then:
Print them in the order in which they were launched;
Print them by ordering them according to the CPU consumption they have made.
What could be the solution?
I tried with the commands
ps -e -opid, lstart, cmd,% cpu
and various combinations, but nothing.

Julia Command Line Running Processes in Parallel

I have a Julia script that converts csvs to a binary format. Trust me it's great. I also have many (seemingly innumerable) csvs that I want to process. It's a shared network and so I can only process five files at a clip without savagely burdening the CPU and making my coworkers irate and potentially unstable. Accordingly, I want to run the script in groups of five, wait for them to finish, and then run the next batch as background processes until it's Miller time all using Julia's wonderful run() function ala:
julia csvparse3.jl /home/file1.csv > /dev/null 2>&1 &
I'm fairly certain that I could sidestep all of this by using addprocs() and pmap() if I made my parsing script into a Julia module/function. However, the reason I'm asking this is because I don't know what I would then do if my original script was written in Fortran or even worse Python? Is there a way for me to achieve my aforementioned goals for an arbitrary number of external programs, ascertain when the processes are finished, and start anew in the context of a simple loop? Many thanks.
With GNU Parallel you can run:
parallel -j5 julia csvparse3.jl ::: /home/*.csv > /dev/null 2>&1
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

running multiple instances of a single MPI executable in parallel on a cluster using mpirun options?

I'm trying to write a shell script to perform some kind of algorithm, and a part of it requires parallel execution of an MPI executable across multiple input files on a grid engine cluster. From what I read, it seems like mpirun supports MPMD execution by using the colon sign or using the application context/schema file and then perform mpirun --app my_appfile. And below is what my my_appfile looks like,
-np 12 /path/to/executable /path/to/dir1/input1
-np 12 /path/to/executable /path/to/dir2/input2
-np 12 /path/to/executable /path/to/dir3/input3
...
-np 12 /path/to/executable /path/to/dir10/input10
I was trying to parallely execute 10 instances of the same executable and assign the resources in the cluster accordingly (120 processes in this case in SGE's orte parallel environment).
However, there was a problem. Each input file was written to generate an output in the same directory as each particular input file. As I submitted the job (the submission script contains only the mpirun --app my_appfile line), it shows only the output from input1 in dir1, but not the rest. So I wonder what is the problem here. Is it the problem with mpirun options or the problem with how the cluster does the job? Any help would be highly appreciated. Thank you!

Making qsub block until job is done?

Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x myscript.sh
will not return until myscript.sh finishes execution.
In PBS you can use qsub -Wblock=true <command>

Resources