Setup Pseudo Distributed / Single Node Setup Apache Hadoop 2.2 - hadoop

I have installed Apache Hadoop 2.2 as Single Node Cluster. When I am trying to execute giraph example, it ends up with error "LocalJobRunner, you cannot run in split master/worker mode since there is only 1 task at a time".
I was going through forums, and I found that I can update mapred-site.xml to have 4 mappers. I tried that but still no help. I came across, one more forum were I can change single node setup to behave as pseudo distributed mode and it resolved the issue.
Can someone please let me know, which config files do I need to change to get single node setup behave as pseudo distributed mode.

Adding to renZzz answer, You also need to check that if you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Following link can help you- https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

for my first setup, I followed some manuals, but surely the best one for single node setup, was the pdf Apache Hadoop YARN_sample. I recommond you to use this manual step by step

First, ensure that the number of workers is one. Then, you need to configure Giraph not to split workers and master via:
giraph.SplitMasterWorker=false
You can either set it in giraph-site.xml or pass via command
line option:
-ca giraph.SplitMasterWorker=false
Ref:
https://www.mail-archive.com/user#giraph.apache.org/msg01631.html

Related

unable to start a job using spark-submit via ssh (on EC2)

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue. Unfortunately, though, I am not able to use spark-submit via ssh.
So, to recap:
This works:
ubuntu#ip-198-43-52-121:~$ spark-submit job.py
This does not work:
ssh -i file.pem ubuntu#blablablba.compute.amazon.com "spark-submit job.py"
Initially, I kept getting the following error message over and over:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's .bashrc file:
export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3
(As the error message referenced python, I also tried adding the line "alias python=python3" to .bashrc, but nothing changed)
After all this, if I try to submit the spark job via ssh I get the following error message:
"command spark-submit not found".
As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's .bashrc file before trying to run the spark job. As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file"
ssh -i file.pem ubuntu#blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE
ssh -i file.pem ubuntu#blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE
(ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file")
All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.
I have also tried providing the full path running the following line:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"
In this case too I get, once again, the following message:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?
It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated. The solution may be very simple, but I cannot get my head around it. Please help.
Thanks a lot in advance :)
The problem was indeed with the way I was expecting the shell to work (which was wrong).
My issue was solved by:
Setting my variables in .profile instead of .bashrc
Providing full path to python
Now I can launch spark jobs via ssh.
I found the solution in the answer #VinkoVrsalovic gave to this post:
Why does an SSH remote command get fewer environment variables then when run manually?
Cheers

ssh key setting for hadoop connection in mutli clusters

I know that ssh key connection should be required for the hadoop operation.
Suppose that there are five clusters consisting of one namenode and four data nodes.
By setting the ssh key connection, we can connect from namenode to datanode and vice versa.
Note that two-way connection should be required for hadoop operation, which means that only one side (namenode to datanode, but not connect to from datanode to namenode) is not possible to operate hadoop as far as I know.
For above scenario, if we have 50 nodes or 100 nodes, it is very laborious jobs to configure all the ssh-key command by connecting the machine and typing same commands ssh-keygen -t ...
For these reasons, I have tried to script the shell code and but failed to do it in an automatic way.
my code is as below.
list.txt
namenode1
datanode1
datanode2
datanode3
datanode4
datanode5
...
cat list.txt | while read server
do
ssh $server 'ssh-keygen' < /dev/null
while read otherserver
do
ssh $server 'ssh-copy-id $otherserver' < /dev/null
done
done
However, it didn't work. As you can understand, the code means that it iterates over all the nodes and creates the key and then copy the generated key into other server using the ssh-copy-id command. But the code didn't work.
So my question is that how to script the codes which enables ssh connection (bothways) using shell scripts...It takes a lot of time for me to achieve it and I cannot find any document describing the ssh connection for multi nodes for avoiding laborious tasks.
You only need to create a public/private key pair at the master node, then use ssh-copy-id -i ~/.ssh/id_rsa.pub $server in the loop. And the master should be in the loop. And there is no need to do this in reverse at the namenodes. The keys have to belong and installed by the user that is running the hadoop cluster. After running the script, you should be able to ssh to all namenodes, as the hadoop user, without using a password.

Hadoop : start-dfs.sh Connection refused

I have a vagrant box on debian/stretch64
I try to install Hadoop3 with documentation
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.htm
When I run start-dfs.sh
I have this message
vagrant#stretch:/opt/hadoop$ sudo sbin/start-dfs.sh
Starting namenodes on [localhost]
pdsh#stretch: localhost: connect: Connection refused
Starting datanodes
pdsh#stretch: localhost: connect: Connection refused
Starting secondary namenodes [stretch]
pdsh#stretch: stretch: connect: Connection refused
vagrant#stretch:/opt/hadoop$
of course I tried to update my hadoop-env.sh with :
export HADOOP_SSH_OPTS="-p 22"
ssh localhost work (without password)
I have not ideas what I can change to solve this problem
There is a problem the way pdsh works by default (see edit), but Hadoop can go without it. Hadoop checks if the system has pdsh on /usr/bin/pdsh and uses it if so. An easy way get away from using pdsh is editing $HADOOP_HOME/libexec/hadoop-functions.sh
replace the line
if [[ -e '/usr/bin/pdsh' ]]; then
by
if [[ ! -e '/usr/bin/pdsh' ]]; then
then hadoop goes without pdsh and everything works.
EDIT:
A better solution would be use pdsh, but with ssh instead rsh as explained here, so replace line from $HADOOP_HOME/libexec/hadoop-functions.sh:
PDSH_SSH_ARGS_APPEND="${HADOOP_SSH_OPTS}" pdsh \
by
PDSH_RCMD_TYPE=ssh PDSH_SSH_ARGS_APPEND="${HADOOP_SSH_OPTS}" pdsh \
Obs: Only doing export PDSH_RCMD_TYPE=ssh, as I mention in the comment, doesn't work. I don't know why...
I've also opened a issue and submitted a patch to this problem: HADOOP-15219
I fixed this problem for hadoop 3.1.0 by adding
PDSH_RCMD_TYPE=ssh
in my .bashrc as well as $HADOOP_HOME/etc/hadoop/hadoop-env.sh.
check if your /etc/hosts file contains the hostname stretch and localhost mapping or not
my /etc/hosts file
Go to your hadoop home directory
~$ cd libexec
~$ nano hadoop-functions.sh
edit this line:
if [[ -e '/usr/bin/pdsh' ]]; then
with:
if [[ ! -e '/usr/bin/pdsh' ]]; then
Additionally, it is recommended that pdsh also be installed for better ssh resource management. —— Hadoop: Setting up a Single Node Cluster
We can remove pdsh to solve this problem.
apt-get remove pdsh
Check if the firewalls are running on your vagrant box
chkconfig iptables off
/etc/init.d/iptables stop
if not that have a look in the underlying logs /var/log/...
I was dealing with my colleague's problem.
he configured ssh using hostname from the hosts file and specified ip in the workers.
after I rewrote the workers file everything worked.
~/hosts file
10.0.0.1 slave01
#ssh-copy-id hadoop#slave01
~/hadoop/etc/workers
slave01
I added export PDSH_RCMD_TYPE=ssh to my .bashrc file, logged out and back in and it worked.
For some reason simply exporting and running right away did not work for me.

Hadoop alternate SSH key

I'm setting up a multinode hadoop cluster and have a shared key to passwordless SSH between nodes. I named the file ~/.ssh/hadoop_rsa and can connect to other hosts using ssh -i ~/.ssh/hadoop_rsa host.
I need some way to tell hadoop to use this alternate SSH key when connecting to other nodes.
It appears that commands are run on each slave using the script:
$HADOOP_HOME/sbin/slaves.sh
That script includes a reference to the environment variable $HADOOP_SSH_OPTS when calling ssh. I was able to tell Hadoop to use a different key file by setting an environment variable like this:
export HADOOP_SSH_OPTS="-i ~/.ssh/hadoop_rsa"
Thanks to Varun on the Hadoop mailing list for pointing me in the right direction

How can I automate running commands remotely over SSH to multiple servers in parallel?

I've searched around a bit for similar questions, but other than running one command or perhaps a few command with items such as:
ssh user#host -t sudo su -
However, what if I essentially need to run a script on (let's say) 15 servers at once. Is this doable in bash? In a perfect world I need to avoid installing applications if at all possible to pull this off. For argument's sake, let's just say that I need to do the following across 10 hosts:
Deploy a new Tomcat container
Deploy an application in the container, and configure it
Configure an Apache vhost
Reload Apache
I have a script that does all of that, but it relies on me logging into all the servers, pulling a script down from a repo, and then running it. If this isn't doable in bash, what alternatives do you suggest? Do I need a bigger hammer, such as Perl (Python might be preferred since I can guarantee Python is on all boxes in a RHEL environment thanks to yum/up2date)? If anyone can point to me to any useful information it'd be greatly appreciated, especially if it's doable in bash. I'll settle for Perl or Python, but I just don't know those as well (working on that). Thanks!
You can run a local script as shown by che and Yang, and/or you can use a Here document:
ssh root#server /bin/sh <<\EOF
wget http://server/warfile # Could use NFS here
cp app.war /location
command 1
command 2
/etc/init.d/httpd restart
EOF
Often, I'll just use the original Tcl version of Expect. You only need to have that on the local machine. If I'm inside a program using Perl, I do this with Net::SSH::Expect. Other languages have similar "expect" tools.
The issue of how to run commands on many servers at once came up on a Perl mailing list the other day and I'll give the same recommendation I gave there, which is to use gsh:
http://outflux.net/unix/software/gsh
gsh is similar to the "for box in box1_name box2_name box3_name" solution already given but I find gsh to be more convenient. You set up a /etc/ghosts file containing your servers in groups such as web, db, RHEL4, x86_64, or whatever (man ghosts) then you use that group when you call gsh.
[pdurbin#beamish ~]$ gsh web "cat /etc/redhat-release; uname -r"
www-2.foo.com: Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
www-2.foo.com: 2.6.9-78.0.1.ELsmp
www-3.foo.com: Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
www-3.foo.com: 2.6.9-78.0.1.ELsmp
www-4.foo.com: Red Hat Enterprise Linux Server release 5.2 (Tikanga)
www-4.foo.com: 2.6.18-92.1.13.el5
www-5.foo.com: Red Hat Enterprise Linux Server release 5.2 (Tikanga)
www-5.foo.com: 2.6.18-92.1.13.el5
[pdurbin#beamish ~]$
You can also combine or split ghost groups, using web+db or web-RHEL4, for example.
I'll also mention that while I have never used shmux, its website contains a list of software (including gsh) that lets you run commands on many servers at once. Capistrano has already been mentioned and (from what I understand) could be on that list as well.
Take a look at Expect (man expect)
I've accomplished similar tasks in the past using Expect.
You can pipe the local script to the remote server and execute it with one command:
ssh -t user#host 'sh' < path_to_script
This can be further automated by using public key authentication and wrapping with scripts to perform parallel execution.
You can try paramiko. It's a pure-python ssh client. You can program your ssh sessions. Nothing to install on remote machines.
See this great article on how to use it.
To give you the structure, without actual code.
Use scp to copy your install/setup script to the target box.
Use ssh to invoke your script on the remote box.
pssh may be interesting since, unlike most solutions mentioned here, the commands are run in parallel.
(For my own use, I wrote a simpler small script very similar to GavinCattell's one, it is documented here - in french).
Have you looked at things like Puppet or Cfengine. They can do what you want and probably much more.
For those that stumble across this question, I'll include an answer that uses Fabric, which solves exactly the problem described above: Running arbitrary commands on multiple hosts over ssh.
Once fabric is installed, you'd create a fabfile.py, and implement tasks that can be run on your remote hosts. For example, a task to Reload Apache might look like this:
from fabric.api import env, run
env.hosts = ['host1#example.com', 'host2#example.com']
def reload():
""" Reload Apache """
run("sudo /etc/init.d/apache2 reload")
Then, on your local machine, run fab reload and the sudo /etc/init.d/apache2 reload command would get run on all the hosts specified in env.hosts.
You can do it the same way you did before, just script it instead of doing it manually. The following code remotes to machine named 'loca' and runs two commands there. What you need to do is simply insert commands you want to run there.
che#ovecka ~ $ ssh loca 'uname -a; echo something_else'
Linux loca 2.6.25.9 #1 (blahblahblah)
something_else
Then, to iterate through all the machines, do something like:
for box in box1_name box2_name box3_name
do
ssh $box 'commmands_to_run_everywhere'
done
In order to make this ssh thing work without entering passwords all the time, you'll need to set up key authentication. You can read about it at IBM developerworks.
You can run the same command on several servers at once with a tool like cluster ssh. The link is to a discussion of cluster ssh on the Debian package of the day blog.
Well, for step 1 and 2 isn't there a tomcat manager web interface; you could script that with curl or zsh with the libwww plug in.
For SSH you're looking to:
1) not get prompted for a password (use keys)
2) pass the command(s) on SSH's commandline, this is similar to rsh in a trusted network.
Other posts have shown you what to do, and I'd probably use sh too but I'd be tempted to use perl like ssh tomcatuser#server perl -e 'do-everything-on-one-line;' or you could do this:
either scp the_package.tbz tomcatuser#server:the_place/.
ssh tomcatuser#server /bin/sh <<\EOF
define stuff like TOMCAT_WEBAPPS=/usr/local/share/tomcat/webapps
tar xj the_package.tbz or rsync rsync://repository/the_package_place
mv $TOMCAT_WEBAPPS/old_war $TOMCAT_WEBAPPS/old_war.old
mv $THE_PLACE/new_war $TOMCAT_WEBAPPS/new_war
touch $TOMCAT_WEBAPPS/new_war [you don't normally have to restart tomcat]
mv $THE_PLACE/vhost_file $APACHE_VHOST_DIR/vhost_file
$APACHECTL restart [might need to login as apache user to move that file and restart]
EOF
You want DSH or distributed shell, which is used in clusters a lot. Here is the link: dsh
You basically have node groups (a file with lists of nodes in them) and you specify which node group you wish to run commands on then you would use dsh, like you would ssh to run commands on them.
dsh -a /path/to/some/command/or/script
It will run the command on all the machines at the same time and return the output prefixed with the hostname. The command or script has to be present on the system, so a shared NFS directory can be useful for these sorts of things.
Creates hostname ssh command of all machines accessed.
by Quierati
http://pastebin.com/pddEQWq2
#Use in .bashrc
#Use "HashKnownHosts no" in ~/.ssh/config or /etc/ssh/ssh_config
# If known_hosts is encrypted and delete known_hosts
[ ! -d ~/bin ] && mkdir ~/bin
for host in `cut -d, -f1 ~/.ssh/known_hosts|cut -f1 -d " "`;
do
[ ! -s ~/bin/$host ] && echo ssh $host '$*' > ~/bin/$host
done
[ -d ~/bin ] && chmod -R 700 ~/bin
export PATH=$PATH:~/bin
Ex Execute:
$for i in hostname{1..10}; do $i who;done
There is a tool called FLATT (FLexible Automation and Troubleshooting Tool) that allows you to execute scripts on multiple Unix/Linux hosts with a click of a button. It is a desktop GUI app that runs on Mac and Windows but there is also a command line java client.
You can create batch jobs and reuse on multiple hosts.
Requires Java 1.6 or higher.
Although it's a complex topic, I can highly recommend Capistrano.
I'm not sure if this method will work for everything that you want, but you can try something like this:
$ cat your_script.sh | ssh your_host bash
Which will run the script (which resides locally) on the remote server.
Just read a new blog using setsid without any further installation/configuration besides the mainstream kernel. Tested/Verified under Ubuntu14.04.
While the author has a very clear explanation and sample code as well, here's the magic part for a quick glance:
#----------------------------------------------------------------------
# Create a temp script to echo the SSH password, used by SSH_ASKPASS
#----------------------------------------------------------------------
SSH_ASKPASS_SCRIPT=/tmp/ssh-askpass-script
cat > ${SSH_ASKPASS_SCRIPT} <<EOL
#!/bin/bash
echo "${PASS}"
EOL
chmod u+x ${SSH_ASKPASS_SCRIPT}
# Tell SSH to read in the output of the provided script as the password.
# We still have to use setsid to eliminate access to a terminal and thus avoid
# it ignoring this and asking for a password.
export SSH_ASKPASS=${SSH_ASKPASS_SCRIPT}
......
......
# Log in to the remote server and run the above command.
# The use of setsid is a part of the machinations to stop ssh
# prompting for a password.
setsid ssh ${SSH_OPTIONS} ${USER}#${SERVER} "ls -rlt"
Easiest way I found without installing or configuring much software is using plain old tmux. Say you have 9 linux servers. Pick a box as your main. Start a tmux session:
tmux
Then create 9 split tmux panes by doing this 8 times:
ctrl-b + %
Now SSH into each box in each pane. You'll need to know some tmux shortcuts. To navigate, press:
ctrl+b <arrow-keys>
Once your logged in to all your boxes on each pane. Now turn on pane synchronization where it lets you type the same thing into each box:
ctrl+b :setw synchronize-panes on
now when you press any keys, it will show up on every pane. to turn it off, just make on to off. to cycle resize panes, press ctrl+b < space-bar >.
This works alot better for me since I need to see each terminal output as sometimes servers crash or hang for whatever reason when downloading or upgrade software. Any issues, you can just isolate and resolve individually.

Resources