How do I add a node to a pseudo-distributed hadoop setup? - hadoop

I've a single-node pseudo-distributed hadoop setup on a unix system in the network. What are the minimum steps to add another computer/node (cygwin) on the network to form a hadoop cluster setup?

Instructions for Hadoop a single node cluster.
http://www.michael-noll.com/blog/2007/08/05/running-hadoop-on-ubuntu/
Instructions for Hadoop a multi node cluster.
http://www.michael-noll.com/blog/2007/08/09/running-hadoop-on-ubuntu-part-2-multi-node-cluster/
The author Michael makes installation and configuration very easy and had been keeping the instructions up to date.

Related

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

I have to create hadoop mulitinode setup using hadoop 2.5.1.Is there any documnet avaiable for same?

I have 4 nodes in my cluster. 1 will be master node, 1will be secondary master node and 2 will be slaves. All these nodes have single node setup running. For multinode setup is there any document available?
Are you using the apache hadoop version or anything from the distributions like cloudera or hortonworks.
For apache hadoop set up refer this.
http://hadoop.apache.org/docs/r0.18.3/cluster_setup.pdf

Deploy Mahout jobs on a cluster

I'm new to Hadoop/Mahout, I understand the concepts, but I'm having issues deploying Mahout jobs to an already set cluster of computers.
I have used Mahout on single computer, but what should I do to make it up and running to an already formed Hadoop cluster?
I have a cluster with Hadoop 0.20.2 installed, and Mahout 0.9, which contains Hadoop 1.2.1. What jars should I copy so I could run code that contains Mahout calls, or what else should I do to make it work on Hadoop cluster?
Any suggestion/example/tutorial would be great.
Thanks
important link for your problem
https://mahout.apache.org/users/clustering/k-means-commandline.html

Hadoop Client Node Configuration

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
Now i know that hadoop software has to be installed in all those 20 machines.
but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?
I am new to hadoop, so from what I understood:
If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.
An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.
In case it can help someone, here is what I have done to connect to a cluster that I don't administer:
get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"
Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.
Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.
Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.
So based on your requirement the definition and placement of client varies.
I recommend this article.
"Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

Configuring Hadoop, HBase and Hive Cluster

I am a newbie to Hadoop, HBase and Hive. I installed Hadoop, HBase and Hive in pseudodistributed mode and everything works fine.
Now I am planning to set up an simple Hadoop Cluster (5 nodes) with Hive, HBase and ZooKeeper. I´ve read several documentations and instructions before but i could not find a good explanation for my question. I´m not sure, where to run all the daemons. This is my consideration:
Node_1 (Master)
NameNode
JobTrakcer
HBase Master
ZooKeeper (Standalone node; managed by HBase)
Node_2 (Backup_Master)
SecondaryNameNode
Node_3 (Slave1)
DataNode1
TaskTracker1
RegionServer1
Node_4 (Slave2)
DataNode2
TaskTracker2
RegionServer2
Node_5 (Slave3)
DataNode3
TaskTracker3
RegionServer3
I know, in production it is recommended to run ZooKeeper ensemble at an odd number of nodes (seperate Cluster). But for a simple cluster, is it OK to set up a standalone ZooKeeper node which runs on the master node?
Another question is regarding Hive: I know that Hive is a Hadoop client. Should I also install Hive on the master node? Does it make sense?
Thanks for all tips and comments!
Hakan
Note: I have just 5 machines to simulate a cluster.
For testing purposes, I believe you can setup Zookeeper on the master node; I did install all of them on the same server.
What I do not understand from your question why you installed hadoop in pseudo distributed mode if you have 5 machines in your cluster? it might be better to install a fully distributed mode.
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=
#iTech : That´s right. If you install hive, you have to set the variable "HADOOP_HOME" to your hadoop installation path.But that´s not the problem..As I said, I worked before with Hadoop and Hive in pseudo distributed mode.
The only problem is, I´m not sure where to run the all the daemons in a 5-node-cluster in fully distributed mode. I´m confused because I want to run a lot of Tools together (Hadoop, HBase and Hive)
Hope that someone have a good tip...
If you are planning to use the described cluster for testing purposes, it is OK to have all your master nodes on the same server. Also you can move the SecondaryNameNode role to Node_1, since SecondaryNameNode is not a backup server for the NameNode, it is there to make checkpoints of your NameNode. So it makes sense to use the Node_2 as another "worker" node in you cluster, or the HiveServer2 and the metastore.
Hope this will help.

Resources