OpenMPI: Simple 2-Node Setup - parallel-processing

I'm having trouble running an OpenMPI program using only two nodes (one of the nodes is the same machine that is executing the mpiexec command and the other node is a separate machine).
I'll call the machine that is running mpiexec, master, and the other node slave.
On both master and slave, I've installed OpemMPI in my home directory under ~/mpi
I have a file called ~/machines.txt on master.
Ideally, ~/machines.txt should contain:
master
slave
However, when I run the following on master:
mpiexec -n 2 --hostfile ~/machines.txt hostname
OUTPUT, I get the following error:
bash: orted: command not found
But if ~/maschines.txt only contains the name of the node that the command is running on, it works.
~/machines.txt:
master
Command:
mpiexec -n 2 --hostfile ~/machines.txt hostname
OUTPUT:
mastermaster
I've tried running the same command on slave, and changed the machines.txt file to contain only slave, and it worked too. I've made sure that my .bashrc file contains the proper paths for OpenMPI.
What am I doing wrong? In short, there is only a problem when I try to execute a program on a remote machine, but I can run mpiexec perfectly fine on the machine that is executing the command. This makes me believe that it's not a path issue. Am I missing a step in connecting both machines? I have passwordless ssh login capability from master to slave.

This error message means that you either do not have Open MPI installed on the remote machine, or you do not have your PATH set properly on the remote machine for non-interactive logins (i.e., such that it can't find the installation of Open MPI on the remote machine). "orted" is one of the helper executables that Open MPI uses to launch processes on remote nodes -- so if "orted" was not found, then it didn't even get to the point of trying to launch "hostname" on the remote node.
Note that there might be a difference between interactive and non-interactive logins in your shell startup files (e.g., in your .bashrc).
Also note that it is considerably simpler to have Open MPI installed in the same path location on all nodes -- in that way, the prefix method described above will automatically add the right PATH and LD_LIBRARY_PATH when executing on the remote nodes, and you don't have to muck with your shell startup files.
Note that there are a bunch of FAQ items about these kinds of topics on the main Open MPI web site.

Either explicitly set the absolute OpenMPI prefix with the --prefix option:
prompt> mpiexec --prefix=$HOME/mpi ...
or invoke mpiexec with the absolute path to it:
prompt> $HOME/mpi/bin/mpiexec ...
The latter option sets the prefix automatically. The prefix is then used to set PATH and LD_LIBRARY_PATH on the remote machines.

This answer comes very late but for linux users, it is a bad habit to add the environment variables at the end of the ~/.bashrc file, because carefully looking at the top, you will notice an if function exiting if in non-interactive mode, which is precisely what you do compiling your program through the ssh host. So put your environment variables at the TOP of the file, before this exiting if

try edit the file
/etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/hadoop/openmpi_install/bin"
LD_LIBRARY_PATH=/home/hadoop/openmpi_install/lib

Related

Shell Script Issue Running Command Remotely using SSH

I have a deploy script in which I want to clear the cache of my CDN. When I am on the server and run my script everything is fine, however when I SSH in and run only that file (i.e. not actually getting into the server, cding into the directory and running it) it fails and states the my doctl command cannot be found. This seems to only be an issue with this program over ssh, running systemctl --help works fine.
Please note that I have installed Digital Ocean's doctl using sudo snap install doctl and it is there.
Here is the .sh file (minus comments):
#!/bin/sh
doctl compute cdn flush [MYID] --files [*] # static cache
So I am not sure what the issue is. Anybody have an idea?
Again, if I get into the server and run the file all works, but here is the SSH command I use that returns the error:
ssh root#123.45.678.999 "/deploy/clear_digital_ocean_cache.sh"
And here is the error.
/deploy/clear_digital_ocean_cache.sh: 10: doctl: not found
Well one solution was to change the command to be an absolute path inside my .sh file like so:
#!/bin/sh
/snap/bin/doctl compute cdn flush [MYID] --files [*] # static cache
I realized that I could run my user commands with ssh (like systemctl) so it was either change where doctl was located (i.e. in the user bin) or ensure that the command was called with an absolute path adding the /snap/bin/ in front of the command.

unable to start a job using spark-submit via ssh (on EC2)

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue. Unfortunately, though, I am not able to use spark-submit via ssh.
So, to recap:
This works:
ubuntu#ip-198-43-52-121:~$ spark-submit job.py
This does not work:
ssh -i file.pem ubuntu#blablablba.compute.amazon.com "spark-submit job.py"
Initially, I kept getting the following error message over and over:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's .bashrc file:
export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3
(As the error message referenced python, I also tried adding the line "alias python=python3" to .bashrc, but nothing changed)
After all this, if I try to submit the spark job via ssh I get the following error message:
"command spark-submit not found".
As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's .bashrc file before trying to run the spark job. As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file"
ssh -i file.pem ubuntu#blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE
ssh -i file.pem ubuntu#blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE
(ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file")
All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.
I have also tried providing the full path running the following line:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"
In this case too I get, once again, the following message:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?
It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated. The solution may be very simple, but I cannot get my head around it. Please help.
Thanks a lot in advance :)
The problem was indeed with the way I was expecting the shell to work (which was wrong).
My issue was solved by:
Setting my variables in .profile instead of .bashrc
Providing full path to python
Now I can launch spark jobs via ssh.
I found the solution in the answer #VinkoVrsalovic gave to this post:
Why does an SSH remote command get fewer environment variables then when run manually?
Cheers

Get the remote bash shell to use a .bash_history file that is on my local machine

My environment contains clusters with multiple hosts in each cluster and as such I tend to run similar or equivalent commands on the hosts in such a cluster.
Sometimes, I am ssh-ed into a cluster host and remember that I had run a certain command on another host in this cluster but I can't remember which host I ran it on, however I need to run that command again.
Since every host in the cluster has its own .bash_history, I have to log in to each and every one of them and look through the .bash_history file to locate that command.
However, if I could use one .bash_history file for all hosts in the cluster (e.g. named .bash_history.clusterX) then I would be able to search the command in the bash history (with CTRL+R) and execute it.
Is that possible?
In my setup shared home directory (via nfs, etc.) is not an option.
Another approach is to leave the relevant commands to execute in an executable file ('ssh_commands') in the home folder of each remote user on each machine.
Those ssh_commands will include the commands you need to execute on each server whenever you open an SSH session.
To call that file on each SSH session:
ssh remoteUser#remoteServer -t "/bin/bash --init-file <(echo 'source ssh_commands')"
That way, you don't have to look for the right commands to execute, locally or remotely: your SSH session opens and execute right away what you want.

How to include a sub-script in a remote shell from remote location?

I am running a local bootstrap.sh script from OSX on a remote Ubuntu server which does some "if else then" stuff to load a specific subscript.sh when a specific condition is met.
I am running that local script with:
ssh user#host "bash -s" <~/projects/projectname/bootstrap.sh
I am having issues with getting the subscript.sh sourced (loaded/included).
You can't. You're only sending the contents of bootstrap.sh to the remote shell. It's attempting to source subscript.sh on the remote machine, and it isn't there.
You'll need to either copy subscript.sh (or both scripts!) to the remote machine, or insert the contents of subscript.sh into bootstrap.sh in place of the source command.
What I would recommend is to rsync your 'bootstrap.sh' from your local computer to your server. You should be able to do this with your ssh credentials.
A very cool utility is Transmit. It is $25 and allows you to cleanly mount your server as if it were a portable hard drive (Transmit can also do synchronizations). All you need is ssh credentials and is very user friendly.
If you are allowed to install on your server, then I would install qsub on it. (Actually just check to see if it is installed.) Then just mount your computer's drive and you can submit scrips with qsub (I actually would just make a small server on your mac). This is what I use for using a linux cluster from my OSX computer.
Alternatively you can make a small server from your osx and have it mounted on your linux server.

cp command fails when run in a script called by Hudson

This one is a puzzler. If I run a command from the command line to copy a file remotely it works perfectly. If I run that same command inside a script on the server (that hosts Hudson), it runs perfectly as well, same for running the job as hudson from the command line. However, if I run that exact command as a function inside a bash script from a Hudson job, it fails with:
cp: cannot stat '/opt/flash_board.tar.gz': No such file or directory
The variable is defined as:
original_tarball=flash_board.tar.gz
and is in scope (variable expansion works correctly in the script).
The original command is:
ssh -n -o stricthostkeychecking=no root#$IP_ADDRESS ssh -n -o stricthostkeychecking=no 169.254.0.2 cp /opt/$original_tarball /opt/$original_tarball.bak
I've also tried it as:
ssh -n -p 1601 -o stricthostkeychecking=no root#$IP_ADDRESS cp /opt/$original_tarball /opt/$original_tarball.bak
which points to the correct port, but fails in exactly the same way.
For reference all the variables have been checked to be valid. I originally thought this was a substitution error, but that doesn't seem to be the case, so then I tried running it with Hudson credentials as:
sudo -u hudson ssh -n -o stricthostkeychecking=no root#$IP_ADDRESS ssh -n -o stricthostkeychecking=no 169.254.0.2 cp /opt/$original_tarball /opt/$original_tarball.bak
I get the exact same results (it works). So it's only when this command is run from a Hudson job that it fails.
Here's the sequence of events:
Hudson job sets parameters & calls a shell script.
A function inside the script tries to copy the files remotely from an embedded Montevista (Linux) board across an SPI bus to a second embedded Arago (Linux) board
Both boards are physically on the same mother board, but there's no way to directly access the Arago board except through a serial console session (which isn't feasible, this is an automation job that runs across the network).
I've tried this using ssh with -p 1601 (the correct port to the Arago side).
Can I use scp to copy a remote file to the same location as the remote file with a different file extension?
Something like:
scp -o stricthostkeychecking=no root#$IP_ADDRESS /opt/$original_tarball /opt/$original_tarball.bak
I had a couple of the devs take a look at this and they were stumped as well. Anyone got any ideas (A) why this fails & (B) how to work around it. I'm pretty sure I can write a script to run locally on the remote machine, but that doesn't seem like it should be necessary.
Oh, and if I run the exact same command on the Montevista board (which means I don't have to go across the SPI bus (169.254.0.2), it works perfectly from the Hudson job.
So, this turned out to be something completely unrelated to the question. I broke the problem down into little pieces with a test Hudson script, adding more and more complexity from the original script till it failed as before.
It turned out to be pilot error, I'd written an if statement to differentiate between the two boards (Arago & Montevista) and then abstracted out the variables passed to the if statement to the point where it was ambiguous which board was being passed in, so the if logic always grabbed the first match (as it should) and the flash script I was trying to copy on the Arago board didn't exist on the Montevista board (well, it has a different name) so the error returned was absolutely correct.
Sorry for the spin up and thanks for all the effort to help.
cp: cannot stat '/opt/flash_board.tar.gz': No such file or directory
This is saying that Hudson cannot see the file. I would do a ls -la /opt in that shell script of yours. This will show you the permissions on the /opt directory, and whether your script can list that file.
While you're at it, do a du -f on the Hudson machine too and see if that /opt directory is a remote mount or something that could be problematic.
You've already said that you logged in as the user that runs the Hudson task and execute it from the workspace directory.
Right now, I suspect that the directory permission is an issue.
The obvious way that goes wrong is that somehow it is being run on the wrong machine, possibly due to either a line length limit, or to weird quoting issues.
I'd try changing the command to … uname -a or … hostname -f to see if you get the right machine. Or, alternatively, … cp /proc/cpuinfo /tmp/this-machine and then see which machine gets the file.
edit: I see now that OP has answered his own question. I guess I'll leave this here in case it helps any future visitors with similar issues. I guess I should add "or not running the command you thing you're running" to the reasons why it could happen.

Resources