openmpi runtime error: Hello World run on hosts - parallel-processing

I'm trying to setup a cluster. Up to now I'm testing it only with 1 master and 1 slave. Running the script from the master it starts printing the HelloWorld, but then I get the following error:
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
it keeps printing HelloWorld and after a while:
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[62648,1],2]
Exit code: 2
Then the code stops. By chance I tried to run the script from the slave and it works. I can't figure out why.
I've set passwordless SSH and running a file located in a nfs-mounted folder.
Can you help me?
Thanks

SOLVED: I've parsed all configurations files I've modified and finally there was a mistake in /etc/hosts. This is about the program working if launched from the node to the master and not viceversa. Regarding the program stopping, it is somehow related to the node not able to find the file to run. Fixed this setting up again the nfs.
Thanks for your help, hope this can be useful to other users.

Related

How to fail Azure devops pipeline task specifically for failures in bash script

I am using Azure Devops pipeline and in that there is one task that will create KVM guest VM and once VM is created through packer inside the host it will run a bash script to check the status of services running inside the guest VM.
If any services are not running or thrown error then this bash script will exit with code 3 as i have added the value in bash script as below
set -e
So i want the task to fail if the above bash script fails, but issue is in the same task as KVM guest VM is getting created so while booting up and shutdown it throws expected errors but i dont want this task to fail due these error but to fail it only bash scripts fails.
i have selected the option in task "Fail on Standard Error"
But not sure how we can fail the task specifically for bash script error, can anyone have some suggestions on this?
You can try and use exit 1 command to have the bash task failed. And it is often a command you'll issue soon after an error is logged.
Additionally, you also may use logging commands to customized a error message. Kindly refer to the sample below.
#!/bin/bash
echo "##vso[task.logissue type=error]Something went very wrong."
exit 1

Boot hangs: A start job is running

In a virtual box I have a Debian that I sometimes want to run without X. So I edited /etc/grub.d/10_linux and added another menu item with a kernel option "nox" appended. Then I added a line to /lib/systemd/system/lightdm.service, Section [Unit]:
ConditionKernelCommandLine=!nox
However, when starting this, it hangs with the message:
A start job is running for Hold until boot process finishes up (56min / no limit)
Thank you, systemd for informing me about that. I wouldn't have noticed. Yet, I would like to know, which job it is that's hanging.
The system allows me to connect via SSH, but none of the systemctl or journalctl commands I tried did tell me the name of the service causing the problem. lightdm.service itself seems to be satisfied.
I known it's a but late, but I just found out that one can use:
systemctl list-jobs
to find out what units are waiting or running at any given moment.
By adding systemd.debug-shell=1 to the kernel command line, a root shell will be available on TTY9 (crlt+alt+F9) to run the command above.
I first tried "systemd-analyze", and that gave me the message about "systemctl list-jobs".
hope this helps someone with similar problems.

error while running shell script through jenkins pipeline

Getting below error while trying to run shell script
+ /home/pqsharma/symlinkBuild.sh 19.07
sh: line 1: 21887 Terminated sleep 3
With Jenkinsfile:
node ('linux')
{
stage('creating symlink')
{stdout = sh(script:'/home/pqsharma/symlinkBuild.sh 19.07 ', returnStdout: true)
}
}
This is followed by JENKINS 55308: "intermittent "terminated" messages using sh in Pipelines"
Jenkins master runs from a Docker image based on jenkins/jenkins:2.138.2-alpine with specific plugins baked into the image by /usr/local/bin/install-plugins.sh
The message originates in durable-task-plugin, which must be a dependency of one of the plugins.txt plugins.
Check if this is the case for you.
Caused by JENKINS 55867: "sh step termination is never detected if the wrapper process is killed"
When you execute a shell step, Jenkins runs a wrapper shell process that's responsible for saving the exit code of your script. If this process is killed, then Jenkins never discovers that your script has terminated, and the step hangs forever.
This seems to have been introduced after v1.22 of durable-task-plugin
Diagnostic:
The sleep 3 is part of the execution of a shell step.
A background process touches a specific file on the agent every 3 seconds, and the Jenkins master checks the timestamp on that file as a proxy to know whether the script is still running or not.
It seems based on the reports here that something is causing that process to be killed on some systems, but I don't have any ideas about what it could be offhand.
Possible cause:
The bug is not just in the durable-task-plugin, although the symptoms come from there. It is introduced when you upgrade workflow-job. I have managed to pinpoint it down to a specific version.
Upgrading workflow-job to 2.27 or later triggers the bug. (2.26 does not exist.)
So try and downgrade your workflow-job plugin to 2.25

ceph health command returns a failure

i'm new to ceph but have to build a mini-cluster as part of a project, i have been following an online tutorial of how to build one and all was fine until i restarted my machines the following day. now when i perform the command ceph health it returns an error saying: 2015-01-08 15:35:04.037375 7fae717fa700 0 -- :/1003525 >> 192.168.1.12:6789/0 pipe(0x7fae6c000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fae6c000e90).fault.
and whenever i run the same command on the 192.168.1.12 machine it returns an error saying: monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication.
0 librados: client.admin initialization error (2) No such file or directory. Error connecting to cluster: ObjectNotFound.
I have been searching the internet for a while now for any answers and not found much, i noticed this site tends to be good in answering most if not all questions though, so any help would be greatly appreciated thanks. Im using centos 7 on all machines if thats any help.
Check if you have the permission to read the keyring file in
/etc/ceph/ceph.client.admin.keyring
If this file is not readable by your user, or it is missing, you are not able to do
ceph -w
If the keyring is missing you can install the keyring from the admin node using ceph-deploy admin serverhostname
As the error saying: ERROR: missing keyring. That means you don't have the keyring file.
Beside, this error,
error saying: 2015-01-08 15:35:04.037375 7fae717fa700 0 -- :/1003525 >> 192.168.1.12:6789/0 pipe(0x7fae6c000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fae6c000e90).fault.
It means your monitor didn't start up cause you missing the keyring file.
Step to resolve this problem:
1. Check the monitor host, and let it start up.
2. Execute the command "ceph -s" on monitor to check this cluster.

start daemon on remote server via Jenkins SSH shell script exits mysteriously

I have a build job on jenkins that is building my project and after it is done, it opens an ssh shell script on a remote server and transfers files and then stop and starts a daemon.
When I stop and start the daemon from the command line on a RHEL server, it executes just fine. When the job executes in jenkins, there are no errors.
The daemon stops fine and it starts fine. But shortly after starting, the daemon dies suddenly.
sudo service daemonName stop
# transfer files.
sudo service daemonName start
I'm sure that the problem isn't pathing
Does anyone know what could be special about the way Jenkins is executing the ssh shell script that would cause the daemon start to not fully complete?
The problem:
When executing a build through jenkins, the command to start the daemon process was clearly successfully executing, yet after the build job was done, the daemon would suddenly quit.
The solution:
I thought for this whole time that it was jenkins killing the daemon. So I tried many different incarnations and permutations of disabling the ProcessTree module that goes through and cleans up zombie child processes. I tried fooling it by resetting the BUILD_ID environment variable. Nothing worked.
Thanks to this thread I found out that that solution only works for child processes executed on the BUILD machine. I.E. not applicable to my problem.
More searching led me here: Run a persistent process via ssh
The solution? Nohup.
So now the build successfully restarts the daemon by executing the following:
sudo nohup service daemonname start
Jenkins watches for processes spawned by the job and kill them to avoid zombie processes.
See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller
The workaround is to override the BUILD_ID environment variable:
BUILD_ID=dontKillMe

Resources