running multiple instances of a single MPI executable in parallel on a cluster using mpirun options?

I'm trying to write a shell script to perform some kind of algorithm, and a part of it requires parallel execution of an MPI executable across multiple input files on a grid engine cluster. From what I read, it seems like mpirun supports MPMD execution by using the colon sign or using the application context/schema file and then perform mpirun --app my_appfile. And below is what my my_appfile looks like,
-np 12 /path/to/executable /path/to/dir1/input1
-np 12 /path/to/executable /path/to/dir2/input2
-np 12 /path/to/executable /path/to/dir3/input3
-np 12 /path/to/executable /path/to/dir10/input10
I was trying to parallely execute 10 instances of the same executable and assign the resources in the cluster accordingly (120 processes in this case in SGE's orte parallel environment).
However, there was a problem. Each input file was written to generate an output in the same directory as each particular input file. As I submitted the job (the submission script contains only the mpirun --app my_appfile line), it shows only the output from input1 in dir1, but not the rest. So I wonder what is the problem here. Is it the problem with mpirun options or the problem with how the cluster does the job? Any help would be highly appreciated. Thank you!


How to utilise GNU parallel efficiently?

I have a script say, whose contents are 10 different python calls shown below:
Now, I use GNU parallel
nohup parallel -j 5 < &
It starts as expected; 5 different processors are being used and the first 5 scripts, ... are running. Now I notice that some of them (say two of them and complete very fast, whereas the others need more time to complete.
Now, there are unused resources (2 processors) while waiting for the remaining 3 scripts (,, and to complete so that the next 5 can be loaded. Is there a way to use these resources by loading new ones as existing commands get completed?
For information: My OS is CentOS
As #RenaudPacalet says there is nothing else to do.
So there is something in your scripts which causes this not to happen.
To help debug you can use:
parallel --lb --tag <
and maybe add a "Starting X" line at the beginning of and a "Finishing X" line at the end of so you can see that the scripts are indeed finishing.
Without knowing anything about it is impossible to say what is causing this.
(Instead of nohup consider using tmux or screen so you can have the jobs run in the background but always check in on them and see their output. nohup is not ideal for debugging).

Multiple mpirun executions at the same time?

I have a list of mpirun commands serially in a bash script. I can run the bash script that causes the individual mpirun commands to be executed serially (but parallel within themselves). Now, each of these mpirun commands require only a fraction of total computational resources the system has. That means, when the first mpirun command executes, only few CPUs are working, rest should be idle. Is there a way to again 'channelize' each mpirun commands of the bash script into different sets of CPUs so that almost all the computing resources get used efficiently? (Extra: Each mpirun command executes a common python code but with different set of arguments.)

Julia Command Line Running Processes in Parallel

I have a Julia script that converts csvs to a binary format. Trust me it's great. I also have many (seemingly innumerable) csvs that I want to process. It's a shared network and so I can only process five files at a clip without savagely burdening the CPU and making my coworkers irate and potentially unstable. Accordingly, I want to run the script in groups of five, wait for them to finish, and then run the next batch as background processes until it's Miller time all using Julia's wonderful run() function ala:
julia csvparse3.jl /home/file1.csv > /dev/null 2>&1 &
I'm fairly certain that I could sidestep all of this by using addprocs() and pmap() if I made my parsing script into a Julia module/function. However, the reason I'm asking this is because I don't know what I would then do if my original script was written in Fortran or even worse Python? Is there a way for me to achieve my aforementioned goals for an arbitrary number of external programs, ascertain when the processes are finished, and start anew in the context of a simple loop? Many thanks.
With GNU Parallel you can run:
parallel -j5 julia csvparse3.jl ::: /home/*.csv > /dev/null 2>&1
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - || curl || fetch -o - | bash
For other installation options see
Learn more
See more examples:
Watch the intro videos:
Walk through the tutorial:
Sign up for the email list to get support:

MPI not using all CPUs allocated

I am trying to run some code across multiple CPUs using MPI.
I run using:
$ mpirun -np 24 python
I'm running on a cluster with 8 nodes, each with 12 CPUs. My 24 processes get scattered across all nodes.
Let's call the nodes node1, node2, ..., node8 and assume that the master process is on node1 and my job is the only one running. So node1 has the master process and a few slave processes, the rest of the nodes have only slave processes.
Only the node with the master process (ie node1) is being used. I can tell because nodes2-8 have load ~0 and node1 has load ~24 (whereas I would expect the load on each node to be approximately equal to the number of CPUs allocated to my job from that node). Also, each time a function is evaluated, I get it to print out the name of the host on which its running, and it prints out "node1" every time. I don't know whether the master process is the only one doing anything or if the slave processes on the same node as the master are also being used.
The cluster I'm running on was recently upgraded. Before the upgrade, I was using the same code and it behaved entirely as expected (i.e. when I asked for 24 CPUs, it gave me 24 CPUs and then used all 24 CPUs). This problem has only arisen since the upgrade, so I assume a setting somewhere got changed or reset. Has anyone seen this problem before and know how I might fix it?
Edit: This is submitted as a job to a scheduler using:
#$ -cwd
#$ -pe * 24
#$ -o $JOB_ID.out
#$ -e $JOB_ID.err
#$ -r no
#$ -m n
#$ -l h_rt=24:00:00
echo job_id $JOB_ID
echo hostname $HOSTNAME
mpirun -np $NSLOTS python
The cluster is running SGE and I submit this job using:
qsub myjob
It's also possible to specify where you want your jobs to run by using a hostfile. How the hostfile is formatted and used varies by MPI implementation so you'll need to consult the documentation for the one you have installed (man mpiexec) to find out how to use it.
The basic idea is that inside that file, you can define the nodes that you want to use and how many ranks you want on those nodes. This may require using other flags to specify how the processes are mapped to your nodes, but it the end, you can usually control how everything is laid out yourself.
All of this is different if you're using a scheduler like PBS, TORQUE, LoadLeveler, etc. as those can sometimes do some of this for you or have different ways of mapping jobs themselves. You'll have to consult the documentation for those separately or ask another question about them with the appropriate tags here.
Clusters usually have a batch scheduler like PBS, TORQUE, LoadLeveler, etc. These are generally given a shell script that contains your mpirun command along with environment variables that the scheduler needs. You should ask the administrator of your cluster what the process is for submitting batch MPI jobs.

Making qsub block until job is done?

Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x
will not return until finishes execution.
In PBS you can use qsub -Wblock=true <command>
