Spawning child processing on HPC using slurm - boost

I encountered a problem when I wanted to spawn child processes on HPC using slurm script. The parent process is a python script, and the child process is a C++ program. Now I put the source code below:
parent process:
# parent process : mpitest.py
from mpi4py import MPI
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=x)
child process:
// child process: test_mpi.cpp, in which parallelization is implemented using boost::mpi
#include <boost/mpi/environment.hpp>
#include <boost/mpi/communicator.hpp>
#include <boost/mpi.hpp>
#include <iostream>
using namespace boost;
int main(int argc, char* argv[]){
boost::mpi::environment env(argc, argv);
boost::mpi::communicator world;
int commrank;
MPI_Comm_rank(MPI_COMM_WORLD, &commrank);
std::cout << commrank << std::endl;
return 0;
}
the slurm script:
#!/bin/bash
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun -np y python mpitest.py
In the above code, there are 2 parameters x (maxprocs in MPI.COMM_SELF.Spawn) and y (-np in mpirun). The slurm script only ran normally when x = 1, y = 1. However, when I tried to increase x and y, the following error occurred (x = 1, y = 2):
[c05n04:13182] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n04:13182] [[31687,1],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c05n11:07942] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n11:07942] [[31687,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[c05n11:07942] *** An error occurred in MPI_Init
[c05n11:07942] *** reported by process [2076639234,0]
[c05n11:07942] *** on a NULL communicator
[c05n11:07942] *** Unknown error
[c05n11:07942] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c05n11:07942] *** and potentially your MPI job)
Similarly, when x = 2, y = 2, the following error occurred:
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes
[c18n08:16481] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n08:16481] [[54742,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01329] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01329] [[54742,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01332] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01332] [[54742,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[c18n11:01332] *** An error occurred in MPI_Init
[c18n11:01332] *** reported by process [3587571714,1]
[c18n11:01332] *** on a NULL communicator
[c18n11:01332] *** Unknown error
[c18n11:01332] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n11:01332] *** and potentially your MPI job)
[c18n08:16469] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:16469] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:16469] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
What I want to do is to use several nodes to run the C++ program in parallel using this python script, i.e. x = 1, y > 1. What should I modify in the python code or the slurm script?
Update: according to the advice by
Gilles Gouaillarde, I modified the slurm script:
#!/bin/bash
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun --mca pml ^ucx --mca btl ^ucx --mca osc ^ucx -np 2 python mpitest.py
It still shows the following error: (here mpirun -np 2 and maxprocs=1 were used)
[c18n08:21555] [[49573,1],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03381] [[49573,3],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
[c18n08:21556] [[49573,1],1] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03380] [[49573,2],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[c18n10:03380] *** An error occurred in MPI_Init
[c18n10:03380] *** reported by process [3248816130,0]
[c18n10:03380] *** on a NULL communicator
[c18n10:03380] *** Unknown error
[c18n10:03380] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n10:03380] *** and potentially your MPI job)
[c18n08:21542] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:21542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:21542] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Related

Oracle EBS 12.9 adnodemgrctl stack

01/23/23-10:07:16 :: adnodemgrctl.sh version 120.11.12020000.12
Calling txkChkEBSDependecies.pl to perform dependency checks for ALL MANAGED SERVERS
ERROR DESCRIPTION:
(FATAL ERROR
PROGRAM : /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl(/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/bin/txkrun.pl)
TIME : Mon Jan 23 10:07:17 2023
FUNCTION: TXK::Process::run [ Level 3 ]
MESSAGES:
Command error: = 2304, = unzip -o /oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/txkChkEBSDependecies_Mon_Jan_23_10_07_17_2023/formsapp.ear -d /oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/txkChkEBSDependecies_Mon_Jan_23_10_07_17_2023
STACK TRACE
at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Error.pm line 168
TXK::Error::abort('TXK::Error', 'HASH(0xd432e8)') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Common.pm line 299
TXK::Common::doError('TXK::Process=HASH(0x19224d0)', 'Command error: = 2304, = unzip -o /oracle/u02/...', undef) called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Common.pm line 314
TXK::Common::setError('TXK::Process=HASH(0x19224d0)', 'Command error: = 2304, = unzip -o /oracle/u02/...') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Process.pm line 449
TXK::Process::run('TXK::Process=HASH(0x19224d0)', 'HASH(0x1a22af0)') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 1163
TXK::RunScript::unzipFile('/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/tx...', '/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/tx...') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 815
TXK::RunScript::extractLatestFrmSrvJar() called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 1248
TXK::RunScript::checkFormsApp() called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 276
eval {...} called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 182
require /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/RunScript.pm line 105
TXK::RunScript::require('TXK::RunScript', '/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txk...') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Script.pm line 177
eval {...} called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Script.pm line 177
TXK::Script::run('TXK::Script=HASH(0x139e460)', '/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK', '/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txk...') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/bin/txkrun.pl line 174
)
W A R N I N G *
Error while executing the perl script txkChkEBSDependecies.pl
We have determined that one or more checks failed for MANAGED SERVERS.
Please check logs for more details.
Validated the passed arguments for the option ebs-get-nmstatus
Connecting to Node Manager ...
Successfully Connected to Node Manager.
The Node Manager is up.
NodeManager log is located at /oracle/u02/VIS/fs2/FMW_Home/wlserver_10.3/common/nodemanager/nmHome1
01/23/23-10:07:21 :: adnodemgrctl.sh: exiting with status 2
CAN ANY ONE HELP ME WITH THIS PROBLEM ?

Custom job script submission to PBS via Dask?

I have a PBS job script with an executable that writes results to out file.
### some lines
PBS_O_EXEDIR="path/to/software"
EXECUTABLE="executablefile"
OUTFILE="out"
### Copy application directory on compute node
[ -d $PBS_O_EXEDIR ] || mkdir -p $PBS_O_EXEDIR
[ -w $PBS_O_EXEDIR ] && \
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_EXEDIR `dirname $PBS_O_EXEDIR`
[ -d $PBS_O_WORKDIR ] || mkdir -p $PBS_O_WORKDIR
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_WORKDIR `dirname $PBS_O_WORKDIR`
# Change into the working directory
cd $PBS_O_WORKDIR
# Save the jobid in the outfile
echo "PBS-JOB-ID was $PBS_JOBID" > $OUTFILE
# Run the executable
$PBS_O_EXEDIR/$EXECUTABLE >> $OUTFILE
In my project, I have to use Dask for this job submission and to monitor them. Therefore, I have configured jobqueue.yaml file like this.
jobqueue:
pbs:
name: htc_calc
# Dask worker options
cores: 4 # Total number of cores per job
memory: 50GB # Total amount of memory per job
# PBS resource manager options
shebang: "#!/usr/bin/env bash"
walltime: '00:30:00'
exe_dir: "/home/r/rb11/softwares/FPLO/bin"
excutable: "fplo18.00-57-x86_64"
outfile: "out"
job-extra: "exe_dir/executable >> outfile"
However, I got this error while submitting jobs via Dask.
qsub: directive error: e
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3d8c4a56a8>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py:284> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nqsub /tmp/tmpwyvkfcmi.sh\nstdout:\n\nstderr:\nqsub: directive error: e \n\n',)>)
Traceback (most recent call last):
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/futures.py", line 294, in result
raise self._exception
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/tasks.py", line 240, in _step
result = coro.send(None)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 41, in _
await self.start()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 285, in start
out = await self._submit_job(fn)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 268, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 368, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
qsub /tmp/tmpwyvkfcmi.sh
stdout:
stderr:
qsub: directive error: e
How do I specify custom bash script in Dask?
Dask is used for distributing Python applications. In the case of Dask Jobqueue it works by submitting a scheduler and workers to the batch system, which connect together to form their own cluster. You can then submit Python work to the Dask scheduler.
It looks like from your example you are trying to use the cluster setup configuration to run your own bash application instead of Dask.
In order to do this with Dask you should return the jobqueue config to the defaults and instead write a Python function which calls your bash script.
from dask_jobqueue import PBSCluster
cluster = PBSCluster()
cluster.scale(jobs=10) # Deploy ten single-node jobs
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client.submit(os.system, "/path/to/your/script") # Run script on all workers
However it just seems like Dask may not be a good fit for what you are trying to do. You would probably be better off just submitting your job to PBS normally.

Julia using Slurm invokes only one node

srun --nodes=3 hostname
returns successfully all the 3 node names but
srun --nodes=3 julia test.jl
fails with error below where test.jl is given at the end here
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Stacktrace:
[1] process_hdr(::TCPSocket, ::Bool) at ./distributed/process_messages.jl:257
[2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:143
[3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
[4] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73
srun: error: n-0-2: task 2: Exited with exit code 1
srun: error: n-0-0: task 0: Exited with exit code 1
srun: error: n-0-2: task 2: Exited with exit code 1
srun: error: n-0-1: task 1: Exited with exit code 1
test.jl
using ClusterManagers
addprocs(SlurmManager(3), partition="slurm", t="00:5:00")
hosts = []
pids = []
for i in workers()
host, pid = fetch(#spawnat i (gethostname(), getpid()))
println(host)
push!(hosts, host)
push!(pids, pid)
end
# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
rmprocs(i)
end
Why?
Note: I do not have julia as module but julia is in a shared directory accessible by all nodes.

QIIME2 dada2 rlang.so error

I was running the QIIME2 moving picture tutorial, at the dada2 step,
I was running:
qiime dada2 denoise-single \
--i-demultiplexed-seqs demux.qza \
--p-trim-left 0 \
--p-trunc-len 120 \
--o-representative-sequences rep-seqs-dada2.qza \
--o-table table-dada2.qza
and ran into this error:
Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1),
please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-52fzrvlu.log.
I then open the file: /tmp/qiime2-q2cli-err-52fzrvlu.log. this is what I found:
Running external command line application(s). This may print messages
to stdout and/or stderr. The command(s) being run are below. These
commands cannot be manually re-run as they will depend on temporary
files that no longer exist.
Command: run_dada_single.R
/tmp/qiime2-archive-pco6y5vm/fe614b44-775f-41b1-9ee3-04319005e830/data
/tmp/tmpda8dnyve/output.tsv.biom /tmp/tmpda8dnyve 120 0 2.0 2
consensus 1.0 1 1000000
R version 3.3.1 (2016-06-21) Loading required package: Rcpp Error in
dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared
object '/home/cao/lib/R/library/rlang/libs/rlang.so':
/home/cao/lib/R/library/rlang/libs/rlang.so: undefined symbol:
R_ExternalPtrAddrFn In addition: Warning message: package ‘Rcpp’ was
built under R version 3.4.1 Error: package or namespace load failed
for ‘dada2’ Execution halted Traceback (most recent call last): File
"/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py",
line 126, in denoise_single
run_commands([cmd]) File "/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py",
line 35, in run_commands
subprocess.run(cmd, check=True) File "/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/subprocess.py",
line 398, in run
output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['run_dada_single.R',
'/tmp/qiime2-archive-pco6y5vm/fe614b44-775f-41b1-9ee3-04319005e830/data',
'/tmp/tmpda8dnyve/output.tsv.biom', '/tmp/tmpda8dnyve', '120', '0',
'2.0', '2', 'consensus', '1.0', '1', '1000000']' returned non-zero
exit status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2cli/commands.py",
line 222, in __call__
results = action(**arguments) File "<decorator-gen-252>", line 2, in denoise_single File
"/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py",
line 201, in callable_wrapper
output_types, provenance) File "/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py",
line 334, in _callable_executor_
output_views = callable(**view_args) File "/home/cao/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py",
line 137, in denoise_single
" and stderr to learn more." % e.returncode) Exception: An error was encountered while running DADA2 in R (return code 1), please
inspect stdout and stderr to learn more.
I then 'sudo R' and installed the Rcpp and rlang packages, but still got the the same error when I run the same code as I first did:
qiime dada2 denoise-single \
--i-demultiplexed-seqs demux.qza \
--p-trim-left 0 \
--p-trunc-len 120 \
--o-representative-sequences rep-seqs-dada2.qza \
--o-table table-dada2.qza
I figured it out:
it was the R version: i uninstalled R3.4 and install R 3.3 and everythign works

Cloud Init Fails to Install Packages on CentOS 7 in EC2

I have a cloud config file defined on my EC2 instance in the user data. Many of the parts/actions are run by the server properly, but package installation seems to fail invariably:
#cloud-config
package-update: true
package-upgrade: true
# ...
packages:
- puppet3
- lvm2
- btrfs-progs
# ...
I see the following in the logs:
May 20 20:23:39 cloud-init[1252]: util.py[DEBUG]: Package update failed
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 74, in handle
cloud.distro.update_package_sources()
File "/usr/lib/python2.6/site-packages/cloudinit/distros/rhel.py", line 278, in update_package_sources
["makecache"], freq=PER_INSTANCE)
File "/usr/lib/python2.6/site-packages/cloudinit/helpers.py", line 197, in run
results = functor(*args)
File "/usr/lib/python2.6/site-packages/cloudinit/distros/rhel.py", line 274, in package_command
util.subp(cmd, capture=False, pipe_cat=True, close_stdin=True)
File "/usr/lib/python2.6/site-packages/cloudinit/util.py", line 1529, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', 'makecache']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
May 20 20:23:39 cloud-init[1252]: amazon.py[DEBUG]: Upgrade level: security
May 20 20:23:39 cloud-init[1252]: util.py[DEBUG]: Running command ['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade'] with allowed return codes [0] (shell=False, capture=False)
May 20 20:23:52 cloud-init[1252]: util.py[WARNING]: Package upgrade failed
May 20 20:23:52 cloud-init[1252]: util.py[DEBUG]: Package upgrade failed
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 81, in handle
cloud.distro.upgrade_packages(upgrade_level, upgrade_exclude)
File "/usr/lib/python2.6/site-packages/cloudinit/distros/amazon.py", line 50, in upgrade_packages
return self.package_command('upgrade', args=args)
File "/usr/lib/python2.6/site-packages/cloudinit/distros/rhel.py", line 274, in package_command
util.subp(cmd, capture=False, pipe_cat=True, close_stdin=True)
File "/usr/lib/python2.6/site-packages/cloudinit/util.py", line 1529, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
May 20 20:23:52 cloud-init[1252]: util.py[DEBUG]: Running command ['yum', '-t', '-y', 'install', 'puppet3', 'lvm2', 'btrfs-progs'] with allowed return codes [0] (shell=False, capture=False)
May 20 20:24:03 cloud-init[1252]: util.py[WARNING]: Failed to install packages: ['puppet3', 'lvm2', 'btrfs-progs']
May 20 20:24:03 cloud-init[1252]: util.py[DEBUG]: Failed to install packages: ['puppet3', 'lvm2', 'btrfs-progs']
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 88, in handle
cloud.distro.install_packages(pkglist)
File "/usr/lib/python2.6/site-packages/cloudinit/distros/rhel.py", line 70, in install_packages
self.package_command('install', pkgs=pkglist)
File "/usr/lib/python2.6/site-packages/cloudinit/distros/rhel.py", line 274, in package_command
util.subp(cmd, capture=False, pipe_cat=True, close_stdin=True)
File "/usr/lib/python2.6/site-packages/cloudinit/util.py", line 1529, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', 'install', 'puppet3', 'lvm2', 'btrfs-progs']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
May 20 20:24:03 cloud-init[1252]: cc_package_update_upgrade_install.py[WARNING]: 3 failed with exceptions, re-raising the last one
May 20 20:24:03 cloud-init[1252]: util.py[WARNING]: Running package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
May 20 20:24:03 cloud-init[1252]: util.py[DEBUG]: Running package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/cloudinit/stages.py", line 553, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.6/site-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.6/site-packages/cloudinit/helpers.py", line 197, in run
results = functor(*args)
File "/usr/lib/python2.6/site-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 111, in handle
raise errors[-1]
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', 'install', 'puppet3', 'lvm2', 'btrfs-progs']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
When I actually run the command yum -t -y install puppet3 lvm2 btrfs-progs as root, it just runs fine, but Cloud Init is failing to run it on its own.
Is there something I'm doing wrong here?
Evidently there was a bug fixed with later versions of the image. If this is happening to you, it may be a legitimate bug in Cloud-Init or your servers implementation of it. A package upgrade may fix the problem, also an update to the image will fix it globally.

Resources