OpenMPI process is killed even using nohup - parallel-processing

I'm running my program using nohup and OpenMPI:
nohup mpirun -np 48 -machinefile temp ./myProgram &
and after some hours I get this error:
--------------------------------------------------------------------------
mpirun noticed that process rank 18 with PID 5445 on node fenix2 exited on signal 1 (Hangup).
--------------------------------------------------------------------------
It occurs in random nodes at random times. I'm using the same seed for all runnings. If I run the same application in Windows with Microsoft HPC MPI then it works fine.
Do you have something that I could do in order to avoid this error?
Thanks!

Related

nvidia-smi monitor only while a specific process is running

In bash, nvidia-smi command gives you information about the GPU.
We also have option to get this periodically such as nvidia-smi -lms 50
I want to get this info only as long as a particular process is running.
Pseudocode
nvidia-smi -lms 50 & > logfile.txt
(time ./process1) > timelog.txt
while process1 is running:
keep nvidia-smi running
kill nvidia-smi
How can I do this in bash, cleanly, such that once my bash script exits no process that starts here is left behind for me to clean?
A direct nvidia-smi based solution would be preferred to a bash based one, but the latter is also perfectly fine.
Run both in the background, then wait for the one your job depends on.
nvidia-smi -lms 50 > logfile.txt &
nvpid=$!
time ./process1 > timelog.txt &
prpid=$!
wait "$prpid"
kill "$nvpid"

How to use GNU parallel (bash scripting) with aprun command on Cray XE6 compute nodes (Unix like env)?

I am trying to run 16 instances on mpi4py python script: hello.py. I stored in s.txt 16 commands of this sort:
python /lustre/4_mpi4py/hello.py > 01.out
I am submitting this in Cray cluster via aprun command like this:
aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'
My intention was to run 8 of those python jobs per node at the time.The script was running more than 3 hours and none of *.out files was created. From PBS scheduler output file I am getting this:
Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
I am running this on one node and it has 32 cores.
I suppose my use of GNU parallel command is wrong. Can someone please help with this.
As listed in https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8
from mpi4py import MPI
comm = MPI . COMM_WORLD
print " Hello ! I’m rank %02d from %02 d" % ( comm .rank , comm . size )
print " Hello ! I’m rank %02d from %02 d" % ( comm . Get_rank () ,
comm . Get_size () )
print " Hello ! I’m rank %02d from %02 d" %
( MPI . COMM_WORLD . Get_rank () , MPI . COMM_WORLD . Get_size () )
your 4_mpi4py/hello.py program is not typical single process (or single python script), but multi-process MPI application.
GNU parallel expects simpler programs and don't support interaction with MPI processes.
In your cluster there are many nodes and every node may start different number of MPI processes (with 2 of 8-core CPU per node think about variants: 2 MPI processes of 8 OpenMP threads each; 1 MPI process of 16 threads; 16 MPI processes without threads). And to describe the slice of cluster to your task there is some interface between cluster management software and the MPI library used by python MPI wrapper used by your script. And the management is the aprun (and qsub?):
http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/
https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/
You must use the aprun command to launch jobs on the Hopper compute nodes. Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs.
https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System
The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is aprun. ... The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system). You need to have nodes allocated by the batch system before (qsub).
There is some interface between aprun and qsub and MPI: in normal start (aprun -n 32 python /lustre/4_mpi4py/hello.py) aprun just starts several (32) processes of your MPI program, sets the id of each process in the interface and gives them the group id (for example, with environment variables like PMI_ID; actual vars are specific to launcher/MPI lib combination).
GNU parallel have no any interface to MPI programs, it know nothing about such variables. It will just start 8 times more processes than expected. And all 32 * 8 processes in your incorrect command will have same group id; and there will be 8 processes with same MPI process id. They will make your MPI library to misbehave.
Never mix MPI resource managers / launchers with ancient before-the-MPI unix process forkers like xargs or parallel or "very-advanced bash scripting for parallelism". There is MPI for doing something parallel; and there is MPI launcher/job management (aprun, mpirun, mpiexec) for starting several processes / forking / ssh-ing to machines.
Don't do aprun -n 32 sh -c 'parallel anything_with_MPI' - this is unsupported combination. Only possible (allowed) argument to aprun is program of some supported parallelism like OpenMP, MPI, MPI+OpenMP or non-parallel programs. (or single script of starting ONE parallel program)
If you have several independent MPI tasks to start, use several arguments to aprun: aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4
If you have multiple files to work on, try to start many parallel jobs, use not single qsub, but several and allow PBS (or which job manager is used) to manage your jobs.
If you have very high number of files, try not to use MPI in your program (don't ever link MPI libs / include MPI headers) and use parallel or other form of ancient parallelism, which is hidden from aprun. Or use single MPI program and program file distribution directly in your code (Master process of MPI may open file list, then distribute files between other MPI processes - with or without dynamic process management of MPI / mpi4py: http://pythonhosted.org/mpi4py/usrman/tutorial.html#dynamic-process-management).
Some scientists tries to combine MPI and parallel in other sequence: parallel ... aprun ... or parallel ... mpirun ...:
https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html#gnu-parallel
http://www.hpc.lsu.edu/training/weekly-materials/2017-Spring/gnuparallel-Feb2017.pdf#page=41
and there is version of parallel for your Cray: https://github.com/levinas/cray-parallel

running multiple instances of a single MPI executable in parallel on a cluster using mpirun options?

I'm trying to write a shell script to perform some kind of algorithm, and a part of it requires parallel execution of an MPI executable across multiple input files on a grid engine cluster. From what I read, it seems like mpirun supports MPMD execution by using the colon sign or using the application context/schema file and then perform mpirun --app my_appfile. And below is what my my_appfile looks like,
-np 12 /path/to/executable /path/to/dir1/input1
-np 12 /path/to/executable /path/to/dir2/input2
-np 12 /path/to/executable /path/to/dir3/input3
...
-np 12 /path/to/executable /path/to/dir10/input10
I was trying to parallely execute 10 instances of the same executable and assign the resources in the cluster accordingly (120 processes in this case in SGE's orte parallel environment).
However, there was a problem. Each input file was written to generate an output in the same directory as each particular input file. As I submitted the job (the submission script contains only the mpirun --app my_appfile line), it shows only the output from input1 in dir1, but not the rest. So I wonder what is the problem here. Is it the problem with mpirun options or the problem with how the cluster does the job? Any help would be highly appreciated. Thank you!

Matlab bad performance under Jenkins

I have a Jenkins script execution step which processes out-data with Matlab to perform evaluation of test results.
When running the script from command prompt it starts up and exits quite fast but when executing the same script with the same arguments from Jenkins it performs extremely por. I get the Matlab welcome message in the "prompt only" window that appears but nothing else within the timeout of 2 hours that I have set for the job.
Have disabled the Jenkins Windows service on node and are running the node-process from desktop but no difference:
C:\Windows\System32\java.exe -jar c:\j-mpc\slave.jar -jnlpUrl http://<server>/slave-agent.jnlp -secret <xxxxx>
Also tried to increase the memory for the node process in but no change:
C:\Windows\System32\java.exe -Xmx2048m
When killing the process-tree starting with bash it indicates that it is inherited from java.exe-sh.exe tree (Pocess Explorer window) but there is a missing PID in between:
java.exe (<0.01%, 1 420 000K)
sh.exe (<0.01%, 2 140K)
bash.exe (<0.01%, 2 580K)
bash.exe ( , 2 580K)
python.exe ( , 6 044K)
python.exe ( , 4 800K)
matlab.exe ( , 1 844K)
MATLAB.exe (<0.01%, 167 324K)
Is there a hidden limitation in child processes that limits the memory or process usage when called from Jenkins, in other jobs I don't see the same limitations. Memory allocation for Matlab is very slow (from start to reasonable size >100M takes about a minute)
(Have a screen dump from Process Explorer but I am not allowed to upload)
EDIT
I have also tried to limit the call to a single windows command line from Jenkins with the same result (suspected that the deep call stack was to blame for it) but same result.
matlab.exe -nodisplay -nosplash -nodesktop -wait -logfile "log_file.txt" -r "try script_file ;catch err; disp(err.message); end ; exit"
Solved by setting the LM_LICENSE_FILE environment variable in Jenkins node setup.
(found a thread about slow startup)
Apparently the shell environment started by Jenkins does not completely comply with the one started from explorer.

Pausing non-essential CPU processes

Is it possible to command the CPU to pause all non-essential processes until my program has finished processing? The goal being to reduce the amount of processes competing for CPU processing time, and I am ultimately expecting an improvement in wall-clock running time of my program.
So I want to start my program running, command the CPU to pause non-essential processes except for my program, and when my program terminates then the CPU can resume the previously paused processes.
On linux, The obvious initial tactic is to increase the priority of your process using renice. The lower the nice value, the higher the priority, with a maximum priority of -20.
(here i create a long running process for example)
sleep 100000 &
as root grep for the process;
ps -ef | grep sleep
500 **4323** 2995 0 18:44 pts/1 00:00:00 sleep 100000
500 4371 2995 0 18:45 pts/1 00:00:00 grep --color=auto sleep
renice the process to a very high priority;
renice -20 4323
You can also send the SIGSTOP and SIGCONT signals to Stop and Continue particular processes like so;
skill -STOP -p <processid>
skill -CONT -p <processid>
Unfortunately, what constitutes non-essential processes is dependent on your own definition. You can stop all non-root processes by examining the process list, and using the following command to stop all of a particular user's processes temporarily;
skill -STOP -u <userid>
skill -CONT -u <userid>
Obviously beware of stopping processes such as the shell that spawned your sudo root session.

Resources