unable to start a job using spark-submit via ssh (on EC2) - bash

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue. Unfortunately, though, I am not able to use spark-submit via ssh.
So, to recap:
This works:
ubuntu#ip-198-43-52-121:~$ spark-submit job.py
This does not work:
ssh -i file.pem ubuntu#blablablba.compute.amazon.com "spark-submit job.py"
Initially, I kept getting the following error message over and over:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's .bashrc file:
export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3
(As the error message referenced python, I also tried adding the line "alias python=python3" to .bashrc, but nothing changed)
After all this, if I try to submit the spark job via ssh I get the following error message:
"command spark-submit not found".
As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's .bashrc file before trying to run the spark job. As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file"
ssh -i file.pem ubuntu#blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE
ssh -i file.pem ubuntu#blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE
(ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file")
All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.
I have also tried providing the full path running the following line:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"
In this case too I get, once again, the following message:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?
It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated. The solution may be very simple, but I cannot get my head around it. Please help.
Thanks a lot in advance :)

The problem was indeed with the way I was expecting the shell to work (which was wrong).
My issue was solved by:
Setting my variables in .profile instead of .bashrc
Providing full path to python
Now I can launch spark jobs via ssh.
I found the solution in the answer #VinkoVrsalovic gave to this post:
Why does an SSH remote command get fewer environment variables then when run manually?
Cheers

Related

Shell Script Issue Running Command Remotely using SSH

I have a deploy script in which I want to clear the cache of my CDN. When I am on the server and run my script everything is fine, however when I SSH in and run only that file (i.e. not actually getting into the server, cding into the directory and running it) it fails and states the my doctl command cannot be found. This seems to only be an issue with this program over ssh, running systemctl --help works fine.
Please note that I have installed Digital Ocean's doctl using sudo snap install doctl and it is there.
Here is the .sh file (minus comments):
#!/bin/sh
doctl compute cdn flush [MYID] --files [*] # static cache
So I am not sure what the issue is. Anybody have an idea?
Again, if I get into the server and run the file all works, but here is the SSH command I use that returns the error:
ssh root#123.45.678.999 "/deploy/clear_digital_ocean_cache.sh"
And here is the error.
/deploy/clear_digital_ocean_cache.sh: 10: doctl: not found
Well one solution was to change the command to be an absolute path inside my .sh file like so:
#!/bin/sh
/snap/bin/doctl compute cdn flush [MYID] --files [*] # static cache
I realized that I could run my user commands with ssh (like systemctl) so it was either change where doctl was located (i.e. in the user bin) or ensure that the command was called with an absolute path adding the /snap/bin/ in front of the command.

How to know what initial commands being executed right after a SSH login?

I was provided a tool to do a SSH to a remote host. The remote host is a new docker to be created. I was trying to understand if there are commands being executed right after the SSH (i.e. probably using ssh -t <some commands>).
It seems like the .bash_history does not include those cmds. In such case, what else can I do to figure out what cmds being executed right after my login? Thank you.
To find out the actual commands that are executed, you could add "set -v" or "set -x" to the shell initialization file(s) on the system you are ssh-ing to.
See man bash (the "INVOCATION" section) to find out which files will executed so that you can figure out which file to add the "set" command to.
You will probably want to do that temporarily ... because the output is verbose.
Another approach would be to configure sshd to set the logging level to DEBUG and see what commands are requested. However, note that sshd DEBUG logging is a user privacy violation.
If you are trying to do this kind of stuff to find out what is happening on the first "boot" of a docker instance, try putting the (temporarily) config changes into the docker image that you are starting.
The bash history only contains command lines that are submitted to the shell via a shell command prompt.

OpenMPI: Simple 2-Node Setup

I'm having trouble running an OpenMPI program using only two nodes (one of the nodes is the same machine that is executing the mpiexec command and the other node is a separate machine).
I'll call the machine that is running mpiexec, master, and the other node slave.
On both master and slave, I've installed OpemMPI in my home directory under ~/mpi
I have a file called ~/machines.txt on master.
Ideally, ~/machines.txt should contain:
master
slave
However, when I run the following on master:
mpiexec -n 2 --hostfile ~/machines.txt hostname
OUTPUT, I get the following error:
bash: orted: command not found
But if ~/maschines.txt only contains the name of the node that the command is running on, it works.
~/machines.txt:
master
Command:
mpiexec -n 2 --hostfile ~/machines.txt hostname
OUTPUT:
mastermaster
I've tried running the same command on slave, and changed the machines.txt file to contain only slave, and it worked too. I've made sure that my .bashrc file contains the proper paths for OpenMPI.
What am I doing wrong? In short, there is only a problem when I try to execute a program on a remote machine, but I can run mpiexec perfectly fine on the machine that is executing the command. This makes me believe that it's not a path issue. Am I missing a step in connecting both machines? I have passwordless ssh login capability from master to slave.
This error message means that you either do not have Open MPI installed on the remote machine, or you do not have your PATH set properly on the remote machine for non-interactive logins (i.e., such that it can't find the installation of Open MPI on the remote machine). "orted" is one of the helper executables that Open MPI uses to launch processes on remote nodes -- so if "orted" was not found, then it didn't even get to the point of trying to launch "hostname" on the remote node.
Note that there might be a difference between interactive and non-interactive logins in your shell startup files (e.g., in your .bashrc).
Also note that it is considerably simpler to have Open MPI installed in the same path location on all nodes -- in that way, the prefix method described above will automatically add the right PATH and LD_LIBRARY_PATH when executing on the remote nodes, and you don't have to muck with your shell startup files.
Note that there are a bunch of FAQ items about these kinds of topics on the main Open MPI web site.
Either explicitly set the absolute OpenMPI prefix with the --prefix option:
prompt> mpiexec --prefix=$HOME/mpi ...
or invoke mpiexec with the absolute path to it:
prompt> $HOME/mpi/bin/mpiexec ...
The latter option sets the prefix automatically. The prefix is then used to set PATH and LD_LIBRARY_PATH on the remote machines.
This answer comes very late but for linux users, it is a bad habit to add the environment variables at the end of the ~/.bashrc file, because carefully looking at the top, you will notice an if function exiting if in non-interactive mode, which is precisely what you do compiling your program through the ssh host. So put your environment variables at the TOP of the file, before this exiting if
try edit the file
/etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/hadoop/openmpi_install/bin"
LD_LIBRARY_PATH=/home/hadoop/openmpi_install/lib

cp command fails when run in a script called by Hudson

This one is a puzzler. If I run a command from the command line to copy a file remotely it works perfectly. If I run that same command inside a script on the server (that hosts Hudson), it runs perfectly as well, same for running the job as hudson from the command line. However, if I run that exact command as a function inside a bash script from a Hudson job, it fails with:
cp: cannot stat '/opt/flash_board.tar.gz': No such file or directory
The variable is defined as:
original_tarball=flash_board.tar.gz
and is in scope (variable expansion works correctly in the script).
The original command is:
ssh -n -o stricthostkeychecking=no root#$IP_ADDRESS ssh -n -o stricthostkeychecking=no 169.254.0.2 cp /opt/$original_tarball /opt/$original_tarball.bak
I've also tried it as:
ssh -n -p 1601 -o stricthostkeychecking=no root#$IP_ADDRESS cp /opt/$original_tarball /opt/$original_tarball.bak
which points to the correct port, but fails in exactly the same way.
For reference all the variables have been checked to be valid. I originally thought this was a substitution error, but that doesn't seem to be the case, so then I tried running it with Hudson credentials as:
sudo -u hudson ssh -n -o stricthostkeychecking=no root#$IP_ADDRESS ssh -n -o stricthostkeychecking=no 169.254.0.2 cp /opt/$original_tarball /opt/$original_tarball.bak
I get the exact same results (it works). So it's only when this command is run from a Hudson job that it fails.
Here's the sequence of events:
Hudson job sets parameters & calls a shell script.
A function inside the script tries to copy the files remotely from an embedded Montevista (Linux) board across an SPI bus to a second embedded Arago (Linux) board
Both boards are physically on the same mother board, but there's no way to directly access the Arago board except through a serial console session (which isn't feasible, this is an automation job that runs across the network).
I've tried this using ssh with -p 1601 (the correct port to the Arago side).
Can I use scp to copy a remote file to the same location as the remote file with a different file extension?
Something like:
scp -o stricthostkeychecking=no root#$IP_ADDRESS /opt/$original_tarball /opt/$original_tarball.bak
I had a couple of the devs take a look at this and they were stumped as well. Anyone got any ideas (A) why this fails & (B) how to work around it. I'm pretty sure I can write a script to run locally on the remote machine, but that doesn't seem like it should be necessary.
Oh, and if I run the exact same command on the Montevista board (which means I don't have to go across the SPI bus (169.254.0.2), it works perfectly from the Hudson job.
So, this turned out to be something completely unrelated to the question. I broke the problem down into little pieces with a test Hudson script, adding more and more complexity from the original script till it failed as before.
It turned out to be pilot error, I'd written an if statement to differentiate between the two boards (Arago & Montevista) and then abstracted out the variables passed to the if statement to the point where it was ambiguous which board was being passed in, so the if logic always grabbed the first match (as it should) and the flash script I was trying to copy on the Arago board didn't exist on the Montevista board (well, it has a different name) so the error returned was absolutely correct.
Sorry for the spin up and thanks for all the effort to help.
cp: cannot stat '/opt/flash_board.tar.gz': No such file or directory
This is saying that Hudson cannot see the file. I would do a ls -la /opt in that shell script of yours. This will show you the permissions on the /opt directory, and whether your script can list that file.
While you're at it, do a du -f on the Hudson machine too and see if that /opt directory is a remote mount or something that could be problematic.
You've already said that you logged in as the user that runs the Hudson task and execute it from the workspace directory.
Right now, I suspect that the directory permission is an issue.
The obvious way that goes wrong is that somehow it is being run on the wrong machine, possibly due to either a line length limit, or to weird quoting issues.
I'd try changing the command to … uname -a or … hostname -f to see if you get the right machine. Or, alternatively, … cp /proc/cpuinfo /tmp/this-machine and then see which machine gets the file.
edit: I see now that OP has answered his own question. I guess I'll leave this here in case it helps any future visitors with similar issues. I guess I should add "or not running the command you thing you're running" to the reasons why it could happen.

How can I automate running commands remotely over SSH to multiple servers in parallel?

I've searched around a bit for similar questions, but other than running one command or perhaps a few command with items such as:
ssh user#host -t sudo su -
However, what if I essentially need to run a script on (let's say) 15 servers at once. Is this doable in bash? In a perfect world I need to avoid installing applications if at all possible to pull this off. For argument's sake, let's just say that I need to do the following across 10 hosts:
Deploy a new Tomcat container
Deploy an application in the container, and configure it
Configure an Apache vhost
Reload Apache
I have a script that does all of that, but it relies on me logging into all the servers, pulling a script down from a repo, and then running it. If this isn't doable in bash, what alternatives do you suggest? Do I need a bigger hammer, such as Perl (Python might be preferred since I can guarantee Python is on all boxes in a RHEL environment thanks to yum/up2date)? If anyone can point to me to any useful information it'd be greatly appreciated, especially if it's doable in bash. I'll settle for Perl or Python, but I just don't know those as well (working on that). Thanks!
You can run a local script as shown by che and Yang, and/or you can use a Here document:
ssh root#server /bin/sh <<\EOF
wget http://server/warfile # Could use NFS here
cp app.war /location
command 1
command 2
/etc/init.d/httpd restart
EOF
Often, I'll just use the original Tcl version of Expect. You only need to have that on the local machine. If I'm inside a program using Perl, I do this with Net::SSH::Expect. Other languages have similar "expect" tools.
The issue of how to run commands on many servers at once came up on a Perl mailing list the other day and I'll give the same recommendation I gave there, which is to use gsh:
http://outflux.net/unix/software/gsh
gsh is similar to the "for box in box1_name box2_name box3_name" solution already given but I find gsh to be more convenient. You set up a /etc/ghosts file containing your servers in groups such as web, db, RHEL4, x86_64, or whatever (man ghosts) then you use that group when you call gsh.
[pdurbin#beamish ~]$ gsh web "cat /etc/redhat-release; uname -r"
www-2.foo.com: Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
www-2.foo.com: 2.6.9-78.0.1.ELsmp
www-3.foo.com: Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
www-3.foo.com: 2.6.9-78.0.1.ELsmp
www-4.foo.com: Red Hat Enterprise Linux Server release 5.2 (Tikanga)
www-4.foo.com: 2.6.18-92.1.13.el5
www-5.foo.com: Red Hat Enterprise Linux Server release 5.2 (Tikanga)
www-5.foo.com: 2.6.18-92.1.13.el5
[pdurbin#beamish ~]$
You can also combine or split ghost groups, using web+db or web-RHEL4, for example.
I'll also mention that while I have never used shmux, its website contains a list of software (including gsh) that lets you run commands on many servers at once. Capistrano has already been mentioned and (from what I understand) could be on that list as well.
Take a look at Expect (man expect)
I've accomplished similar tasks in the past using Expect.
You can pipe the local script to the remote server and execute it with one command:
ssh -t user#host 'sh' < path_to_script
This can be further automated by using public key authentication and wrapping with scripts to perform parallel execution.
You can try paramiko. It's a pure-python ssh client. You can program your ssh sessions. Nothing to install on remote machines.
See this great article on how to use it.
To give you the structure, without actual code.
Use scp to copy your install/setup script to the target box.
Use ssh to invoke your script on the remote box.
pssh may be interesting since, unlike most solutions mentioned here, the commands are run in parallel.
(For my own use, I wrote a simpler small script very similar to GavinCattell's one, it is documented here - in french).
Have you looked at things like Puppet or Cfengine. They can do what you want and probably much more.
For those that stumble across this question, I'll include an answer that uses Fabric, which solves exactly the problem described above: Running arbitrary commands on multiple hosts over ssh.
Once fabric is installed, you'd create a fabfile.py, and implement tasks that can be run on your remote hosts. For example, a task to Reload Apache might look like this:
from fabric.api import env, run
env.hosts = ['host1#example.com', 'host2#example.com']
def reload():
""" Reload Apache """
run("sudo /etc/init.d/apache2 reload")
Then, on your local machine, run fab reload and the sudo /etc/init.d/apache2 reload command would get run on all the hosts specified in env.hosts.
You can do it the same way you did before, just script it instead of doing it manually. The following code remotes to machine named 'loca' and runs two commands there. What you need to do is simply insert commands you want to run there.
che#ovecka ~ $ ssh loca 'uname -a; echo something_else'
Linux loca 2.6.25.9 #1 (blahblahblah)
something_else
Then, to iterate through all the machines, do something like:
for box in box1_name box2_name box3_name
do
ssh $box 'commmands_to_run_everywhere'
done
In order to make this ssh thing work without entering passwords all the time, you'll need to set up key authentication. You can read about it at IBM developerworks.
You can run the same command on several servers at once with a tool like cluster ssh. The link is to a discussion of cluster ssh on the Debian package of the day blog.
Well, for step 1 and 2 isn't there a tomcat manager web interface; you could script that with curl or zsh with the libwww plug in.
For SSH you're looking to:
1) not get prompted for a password (use keys)
2) pass the command(s) on SSH's commandline, this is similar to rsh in a trusted network.
Other posts have shown you what to do, and I'd probably use sh too but I'd be tempted to use perl like ssh tomcatuser#server perl -e 'do-everything-on-one-line;' or you could do this:
either scp the_package.tbz tomcatuser#server:the_place/.
ssh tomcatuser#server /bin/sh <<\EOF
define stuff like TOMCAT_WEBAPPS=/usr/local/share/tomcat/webapps
tar xj the_package.tbz or rsync rsync://repository/the_package_place
mv $TOMCAT_WEBAPPS/old_war $TOMCAT_WEBAPPS/old_war.old
mv $THE_PLACE/new_war $TOMCAT_WEBAPPS/new_war
touch $TOMCAT_WEBAPPS/new_war [you don't normally have to restart tomcat]
mv $THE_PLACE/vhost_file $APACHE_VHOST_DIR/vhost_file
$APACHECTL restart [might need to login as apache user to move that file and restart]
EOF
You want DSH or distributed shell, which is used in clusters a lot. Here is the link: dsh
You basically have node groups (a file with lists of nodes in them) and you specify which node group you wish to run commands on then you would use dsh, like you would ssh to run commands on them.
dsh -a /path/to/some/command/or/script
It will run the command on all the machines at the same time and return the output prefixed with the hostname. The command or script has to be present on the system, so a shared NFS directory can be useful for these sorts of things.
Creates hostname ssh command of all machines accessed.
by Quierati
http://pastebin.com/pddEQWq2
#Use in .bashrc
#Use "HashKnownHosts no" in ~/.ssh/config or /etc/ssh/ssh_config
# If known_hosts is encrypted and delete known_hosts
[ ! -d ~/bin ] && mkdir ~/bin
for host in `cut -d, -f1 ~/.ssh/known_hosts|cut -f1 -d " "`;
do
[ ! -s ~/bin/$host ] && echo ssh $host '$*' > ~/bin/$host
done
[ -d ~/bin ] && chmod -R 700 ~/bin
export PATH=$PATH:~/bin
Ex Execute:
$for i in hostname{1..10}; do $i who;done
There is a tool called FLATT (FLexible Automation and Troubleshooting Tool) that allows you to execute scripts on multiple Unix/Linux hosts with a click of a button. It is a desktop GUI app that runs on Mac and Windows but there is also a command line java client.
You can create batch jobs and reuse on multiple hosts.
Requires Java 1.6 or higher.
Although it's a complex topic, I can highly recommend Capistrano.
I'm not sure if this method will work for everything that you want, but you can try something like this:
$ cat your_script.sh | ssh your_host bash
Which will run the script (which resides locally) on the remote server.
Just read a new blog using setsid without any further installation/configuration besides the mainstream kernel. Tested/Verified under Ubuntu14.04.
While the author has a very clear explanation and sample code as well, here's the magic part for a quick glance:
#----------------------------------------------------------------------
# Create a temp script to echo the SSH password, used by SSH_ASKPASS
#----------------------------------------------------------------------
SSH_ASKPASS_SCRIPT=/tmp/ssh-askpass-script
cat > ${SSH_ASKPASS_SCRIPT} <<EOL
#!/bin/bash
echo "${PASS}"
EOL
chmod u+x ${SSH_ASKPASS_SCRIPT}
# Tell SSH to read in the output of the provided script as the password.
# We still have to use setsid to eliminate access to a terminal and thus avoid
# it ignoring this and asking for a password.
export SSH_ASKPASS=${SSH_ASKPASS_SCRIPT}
......
......
# Log in to the remote server and run the above command.
# The use of setsid is a part of the machinations to stop ssh
# prompting for a password.
setsid ssh ${SSH_OPTIONS} ${USER}#${SERVER} "ls -rlt"
Easiest way I found without installing or configuring much software is using plain old tmux. Say you have 9 linux servers. Pick a box as your main. Start a tmux session:
tmux
Then create 9 split tmux panes by doing this 8 times:
ctrl-b + %
Now SSH into each box in each pane. You'll need to know some tmux shortcuts. To navigate, press:
ctrl+b <arrow-keys>
Once your logged in to all your boxes on each pane. Now turn on pane synchronization where it lets you type the same thing into each box:
ctrl+b :setw synchronize-panes on
now when you press any keys, it will show up on every pane. to turn it off, just make on to off. to cycle resize panes, press ctrl+b < space-bar >.
This works alot better for me since I need to see each terminal output as sometimes servers crash or hang for whatever reason when downloading or upgrade software. Any issues, you can just isolate and resolve individually.

Resources