Hadoop cluster configuration with Ubuntu Master and Windows slave - hadoop

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.

You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.

I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

Related

What are best practices to run command on all nodes in a HDP cluster?

Often in a hadoop environment, you are required to run a command or a script or copy a file to all nodes in the cluster.
What are efficient ways of doing that (without having to ssh to each node separately)?
Example:
When upgrading Ambari, you are required to run many commands on all nodes where a certain component is installed - e.g. Infra, SmartSense, etc.
I use Ansible to do that. It will do the job for you.
But you can use puppet or Salt or Chef.

Starting Hadoop Services using Command Line (CDH 5)

I know how to start services using Cloudera manager interface, but I prefer to know what is really happening behind the scene and not rely on "magic".
I read this page but it does not give the desired information
I know there are some .sh files to be used but they seem to vary from version to version, and I'm using the latest as of today (5.3).
I would be grateful to have a list of service starting commands (specifically HDFS)
PS : Looks like somehow Cloudera ditched the classic Apache scripts (start-dfs.sh etc.)
You can figure this out by installing Cloudera's optional service packages.
These use the service command to start services instead of Cloudera Manager.
hadoop-hdfs-namenode - for namenode
hadoop-hdfs-secondarynamenode - for secondary namenode
hadoop-hdfs-datanode - for datanode
hadoop-hdfs-journalnode - for journalnode
You can see the CDH5.9 RPMs here:
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.9/RPMS/x86_64/
After you install them, you can look at the respective /etc/init.d/SERVICENAME
to understand how they are run (assuming you're comfortable looking at shell scripts).

How could I install Hortonworks' HDP?

I am new on this and I want to know how could I install the solution provided by Hortonworks, HDP, (http://hortonworks.com/products/data-center/hdp/) following the next specification: I have 2 Virtual Machines and another local machine to work with, and I want to use the 2 VM as Master node and Worker node by the time I configure Apache SPARK.
But my question is: what do I have to do to install HDP correctly? I have to install te solution in my local machine and configure Apache SPARK to use those 2 Virtual Machines as Master node and Worker node? Or I must to install HDP in the 3 machines that I have?
I repeat that I am new on this, and it wold be very helpfull for me any answer or comment that you could give.
Thank you, so much!
if you are trying to deploy HDP on a cluster (MultiNode Enviroment)
use Apache Ambari to Install HDP and other services such as spark.
I have tried it on centos environment below are the steps and links for installation.
get the repo by using the command below
wget
http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.2.2.0/ambari.repo
Install Ambari by using the command below
yum install ambari-server
Start the server by using the command below
ambari-server start
Now you can setup your by going to web browser and typing
http://:8080.
Now you can select the desired HDP version. Add the other nodes and services and deploy your cluster.
Detailed steps for installation can be found here

Spark: how to set worker-specific SPARK_HOME in standalone mode [duplicate]

This question already has answers here:
How to use start-all.sh to start standalone Worker that uses different SPARK_HOME (than Master)?
(3 answers)
Closed 4 months ago.
I'm setting up a [somewhat ad-hoc] cluster of Spark workers: namely, a couple of lab machines that I have sitting around. However, I've run into a problem when I attempt to start the cluster with start-all.sh: namely, Spark is installed in different directories on the various workers. But the master invokes $SPARK_HOME/sbin/start-all.sh on each one using the master's definition of $SPARK_HOME, even though the path is different for each worker.
Assuming I can't install Spark on identical paths on each worker to the master, how can I get the master to recognize the different worker paths?
EDIT #1 Hmm, found this thread in the Spark mailing list, strongly suggesting that this is the current implementation--assuming $SPARK_HOME is the same for all workers.
I'm playing around with Spark on Windows (my laptop) and have two worker nodes running by starting them manually using a script that contains the following
set SPARK_HOME=C:\dev\programs\spark-1.2.0-worker1
set SPARK_MASTER_IP=master.brad.com
spark-class org.apache.spark.deploy.worker.Worker spark://master.brad.com:7077
I then create a copy of this script with a different SPARK_HOME defined to run my second worker from. When I kick off a spark-submit I see this on Worker_1
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker1\bin...
and this on Worker_2
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker2\bin...
So it works, and in my case I duplicated the spark installation directory, but you may be able to get around this
You might want to consider assign the name by changing SPARK_WORKER_DIR line in the spark-env.sh file.
A similar question was asked here
The solution I used was to create a symbolic link mimicking the master node's installation path on each worker node so when the start-all.sh executing on the master node does its SSH into the worker node, it will see identical pathing to run the worker scripts.
Example in my case, I had 2 Macs and 1 Linux machine. Both Macs had spark installed under /Users/<user>/spark however the Linux machine had it under /home/<user>/spark. One of the Macs was the master node so running the start-all.sh it would error each time on the Linux machine due to pathing (error: /Users/<user>/spark does not exist)).
The simple solution was to mimic the Mac's pathing on the Linux machine using a symbolic link:
open terminal
cd / <-- go to the root of the drive
sudo ln -s home Users <-- create a sym link "Users" pointing to the actual "home" directory.

Not able to run Hadoop job remotely

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.
Basically, I want to do two things:
Execute the hadoop job remotely.
Retrieve the result from hadoop output directory.
I don't have any idea how to achieve this. I am using hadoop version 1.1.2
I tried passing jobtracker/namenode URL in the Job configuration but it fails.
I have tried the following example : Running java hadoop job on local/remote cluster
Result: Getting error consistently as cannot load directory. It is similar to this post:
Exception while submitting a mapreduce job from remote system
Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Resources