Julia using Slurm invokes only one node - parallel-processing

srun --nodes=3 hostname
returns successfully all the 3 node names but
srun --nodes=3 julia test.jl
fails with error below where test.jl is given at the end here
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
[1] process_hdr(::TCPSocket, ::Bool) at ./distributed/process_messages.jl:257
[2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:143
[3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
[4] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73
srun: error: n-0-2: task 2: Exited with exit code 1
srun: error: n-0-0: task 0: Exited with exit code 1
srun: error: n-0-2: task 2: Exited with exit code 1
srun: error: n-0-1: task 1: Exited with exit code 1
using ClusterManagers
addprocs(SlurmManager(3), partition="slurm", t="00:5:00")
hosts = []
pids = []
for i in workers()
host, pid = fetch(#spawnat i (gethostname(), getpid()))
push!(hosts, host)
push!(pids, pid)
# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
Note: I do not have julia as module but julia is in a shared directory accessible by all nodes.


Spawning child processing on HPC using slurm

I encountered a problem when I wanted to spawn child processes on HPC using slurm script. The parent process is a python script, and the child process is a C++ program. Now I put the source code below:
parent process:
# parent process : mpitest.py
from mpi4py import MPI
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=x)
child process:
// child process: test_mpi.cpp, in which parallelization is implemented using boost::mpi
#include <boost/mpi/environment.hpp>
#include <boost/mpi/communicator.hpp>
#include <boost/mpi.hpp>
#include <iostream>
using namespace boost;
int main(int argc, char* argv[]){
boost::mpi::environment env(argc, argv);
boost::mpi::communicator world;
int commrank;
MPI_Comm_rank(MPI_COMM_WORLD, &commrank);
std::cout << commrank << std::endl;
return 0;
the slurm script:
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun -np y python mpitest.py
In the above code, there are 2 parameters x (maxprocs in MPI.COMM_SELF.Spawn) and y (-np in mpirun). The slurm script only ran normally when x = 1, y = 1. However, when I tried to increase x and y, the following error occurred (x = 1, y = 2):
[c05n04:13182] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n04:13182] [[31687,1],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c05n11:07942] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n11:07942] [[31687,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
[c05n11:07942] *** An error occurred in MPI_Init
[c05n11:07942] *** reported by process [2076639234,0]
[c05n11:07942] *** on a NULL communicator
[c05n11:07942] *** Unknown error
[c05n11:07942] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c05n11:07942] *** and potentially your MPI job)
Similarly, when x = 2, y = 2, the following error occurred:
All nodes which are allocated for this job are already filled.
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes
[c18n08:16481] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n08:16481] [[54742,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01329] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01329] [[54742,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01332] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01332] [[54742,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
[c18n11:01332] *** An error occurred in MPI_Init
[c18n11:01332] *** reported by process [3587571714,1]
[c18n11:01332] *** on a NULL communicator
[c18n11:01332] *** Unknown error
[c18n11:01332] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n11:01332] *** and potentially your MPI job)
[c18n08:16469] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:16469] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:16469] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
What I want to do is to use several nodes to run the C++ program in parallel using this python script, i.e. x = 1, y > 1. What should I modify in the python code or the slurm script?
Update: according to the advice by
Gilles Gouaillarde, I modified the slurm script:
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun --mca pml ^ucx --mca btl ^ucx --mca osc ^ucx -np 2 python mpitest.py
It still shows the following error: (here mpirun -np 2 and maxprocs=1 were used)
[c18n08:21555] [[49573,1],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03381] [[49573,3],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
[c18n08:21556] [[49573,1],1] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03380] [[49573,2],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Not found" (-13) instead of "Success" (0)
[c18n10:03380] *** An error occurred in MPI_Init
[c18n10:03380] *** reported by process [3248816130,0]
[c18n10:03380] *** on a NULL communicator
[c18n10:03380] *** Unknown error
[c18n10:03380] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n10:03380] *** and potentially your MPI job)
[c18n08:21542] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:21542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:21542] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

GNU time returns different signal than it prints out

While running cronjobs and using mk-job from check_mk to monitor its result, I've stumbled across this:
$ /usr/bin/time -f "time_exit: %x" timeout -s SIGKILL 2s sleep 10; echo "shell_exit: $?"
Command terminated by signal 9
time_exit: 0
shell_exit: 137
The exit code returned from /usr/bin/time differs from the exit code it writes to a formatted output:
time_exit != shell_exit
But when using the default SIGHUP signal, exit codes match:
$ /usr/bin/time -f "time_exit: %x" timeout -s SIGHUP 2s sleep 10; echo "shell_exit: $?"
Command exited with non-zero status 124
time_exit: 124
shell_exit: 124
In the meanwhile I will use timeout -k 10s 2s ... which will first send SIGHUP and after 10s a SIGKILL, if the process was still running. In the hope that SIGHUP would properly stop it.
check_mk provides mk-job to monitor job executions. mk-job uses time to record execution times AND exit code.
man time:
The time command returns when the program exits, stops, or is terminated by a signal. If the program exited normally, the return value of time is the return value of the program it executed and measured. Otherwise, the return value is 128 plus the number of the signal which caused the program to stop or terminate.
man timeout:
... It may be necessary to use the KILL (9) signal, since this signal cannot be caught, in which case the exit status is 128+9 rather than 124.
GNU time's %x only makes sense when the process exits normally, not killed by signals.
[STEP 101] $ /usr/bin/time -f "%%x = %x" bash -c 'exit 2'; echo '$? = '$?
Command exited with non-zero status 2
%x = 2
$? = 2
[STEP 102] $ /usr/bin/time -f "%%x = %x" bash -c 'exit 137'; echo '$? = '$?
Command exited with non-zero status 137
%x = 137
$? = 137
[STEP 103] $ /usr/bin/time -f "%%x = %x" bash -c 'kill -KILL $$'; echo '$? = '$?
Command terminated by signal 9
%x = 0
$? = 137
[STEP 104] $
For time timeout -s SIGKILL 2s sleep 10, timeout exits normally with 137, it's not killed by SIGKILL, just like bash -c 'exit 137' in my example.
Took a look at time's source code and found out %x is blindly calling WEXITSTATUS() no matter the process exits normally or not.
655 case 'x': /* Exit status. */
656 fprintf (fp, "%d", WEXITSTATUS (resp->waitstatus));
657 break;
In the Git master it added new %Tx:
549 case 'T':
550 switch (*++fmt)
551 {
575 case 'x': /* exit code IF terminated normally */
576 if (WIFEXITED (resp->waitstatus))
577 fprintf (fp, "%d", WEXITSTATUS (resp->waitstatus));
578 break;
And from the Git master's time --help output:
%Tt exit type (normal/signalled)
%Tx numeric exit code IF exited normally
%Tn numeric signal code IF signalled
%Ts signal name IF signalled
%To 'ok' IF exited normally with code zero

Custom job script submission to PBS via Dask?

I have a PBS job script with an executable that writes results to out file.
### some lines
### Copy application directory on compute node
[ -d $PBS_O_EXEDIR ] || mkdir -p $PBS_O_EXEDIR
[ -w $PBS_O_EXEDIR ] && \
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_EXEDIR `dirname $PBS_O_EXEDIR`
[ -d $PBS_O_WORKDIR ] || mkdir -p $PBS_O_WORKDIR
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_WORKDIR `dirname $PBS_O_WORKDIR`
# Change into the working directory
# Save the jobid in the outfile
# Run the executable
In my project, I have to use Dask for this job submission and to monitor them. Therefore, I have configured jobqueue.yaml file like this.
name: htc_calc
# Dask worker options
cores: 4 # Total number of cores per job
memory: 50GB # Total amount of memory per job
# PBS resource manager options
shebang: "#!/usr/bin/env bash"
walltime: '00:30:00'
exe_dir: "/home/r/rb11/softwares/FPLO/bin"
excutable: "fplo18.00-57-x86_64"
outfile: "out"
job-extra: "exe_dir/executable >> outfile"
However, I got this error while submitting jobs via Dask.
qsub: directive error: e
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3d8c4a56a8>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py:284> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nqsub /tmp/tmpwyvkfcmi.sh\nstdout:\n\nstderr:\nqsub: directive error: e \n\n',)>)
Traceback (most recent call last):
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/futures.py", line 294, in result
raise self._exception
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/tasks.py", line 240, in _step
result = coro.send(None)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 41, in _
await self.start()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 285, in start
out = await self._submit_job(fn)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 268, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 368, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
qsub /tmp/tmpwyvkfcmi.sh
qsub: directive error: e
How do I specify custom bash script in Dask?
Dask is used for distributing Python applications. In the case of Dask Jobqueue it works by submitting a scheduler and workers to the batch system, which connect together to form their own cluster. You can then submit Python work to the Dask scheduler.
It looks like from your example you are trying to use the cluster setup configuration to run your own bash application instead of Dask.
In order to do this with Dask you should return the jobqueue config to the defaults and instead write a Python function which calls your bash script.
from dask_jobqueue import PBSCluster
cluster = PBSCluster()
cluster.scale(jobs=10) # Deploy ten single-node jobs
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client.submit(os.system, "/path/to/your/script") # Run script on all workers
However it just seems like Dask may not be a good fit for what you are trying to do. You would probably be better off just submitting your job to PBS normally.

Killing shell processes when thread dies

I have code like this:
#! /usr/bin/env ruby
Thread.new do
`shellcommand 1 2 3`
If do_other_stuff encounters an exception, it kills the thread and the whole ruby process, which is what I want. But, shellcommand 1 2 3 continues running in the background.
How can I have shellcommand 1 2 3 also be killed when the ruby process aborts?
You can't (with a general flag on Thread at least). It's a separate process that you started in a thread. The thread dying doesn't stop the process.
You have to save the pid of the process and terminate it explicitly:
Thread.report_on_exception = false
Thread.abort_on_exception = true
pids = []
at_exit do
pids.each do |pid|
`kill -9 #{pid}`
Thread.new do
pids.push Process.spawn('for((i=0; ;++i)); do echo "$i"; sleep 1; done')
sleep 5
raise 'Failure'
Traceback (most recent call last):
test.rb:17:in `<main>': Failure (RuntimeError)
Needless to say, don't use this code as is in production.

Exit shell scripts when previously finished child has failed

I have a shell script that executes several sub scripts in the background using & and some scripts by just calling them with ./
This all happens in a loop in the parent script.
At at the end of the loop in the parent script i want to wait until all child processes are finished and get their exit status. If one of the child scripts has failed I want to exit the loop and the parent script.
What i currently do is after each subscript is executed i collect its pid using $!. Then i loop through the pids and call wait $pid || exit 1.
But for some reason i get the error: pid 12345 is not a child of this shell. What could be the reason for this error?
Is there another way to do this?
extract () {
./p1.sh &
pids="$pids $!"
./p2.sh &
pids="$pids $!"
load () {
pids="$pids $!"
echo ${pids[#]}
for p in $pids; do
wait $p
The output is:
pid: 13725 end process 1
pid: 13726 end process 2
pid: 13727 end p3
13725 13726 13726
As you can see the pid 13726 is somehow added twice. The error message i get is: wait: pid 13726 is not a child of this shell
