Hadoop Multinode Cluster installation via Ambari - hadoop

For setting multi node cluster in hadoop through ambari does we require similar type of operating system at both the hosts or different will work too.. for eg mine 1 host has cent os 7 and other has cent os 6 so will the setup will be successfull or it will through an error..

I think it not regarding your OS 7 and OS 6, because Hadoop is deployed by master machine and slaves machines, once you have deployed the master nodes, other nodes can direct join the Hadoop cluster. Also, the YARN will responsible for the task scheduling.

Ambari supports clusters that consist of heterogeneous OSs. Just make sure to provide valid repo URLs for OSs in use when deploying cluster

Related

Install Hadoop in openstack

I'm new to big data. And I have a question about the installation of hadoop.
Currently I use an image on VirtualBox, but I would like to create a cluster on the openstack. At first I thought I just need to instantiate a hadoop image on the openstack or install several instances and use the hadoop docker image.
But I found several examples of the Sahara openstack. Knowing that I already have an openstack shared with several people, is it possible to create a hadoop cluster without going through openstack Sahara? Or is it not recommended?
Not sure about "Sahara Openstack", but you can surely create Hadoop cluster using VM nodes on openstack.
Single node installation guide
http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#
Yes, its possible to create Hadoop cluster on OpenStack cloud without using OpenStack sahara. You can launch 3 Virtual machines on OpenStack, and assign floating IP to these virtual machines.
One can be used as Master and other 2 as slaves. You can follow the Hadoop multinode installation steps on these virtual machines and connect them using SSH configuration which will be mentioned in Hadoop multinode setup guide.
You can also write automated shell script for launching Hadoop on OpenStack.

Where to install Java on multi-node hadoop cluster?

In a multi-node hadoop cluster where there are multiple slave nodes, one master node, and one client node, where all do we need java to be installed?
Also is that we need hadoop to be installed only on the client node? I get confused after going through sites where they mention that we first need to install Java but it does not mention on which node do we need to install it.
Java is prerequisite to run Hadoop. You need to install java in all the machines even in client also.
Coming to client configuration. In client machine no need to install Hadoop. It is just to communicate with the Hadoop cluster
Check below links for more
Hadoop Client Node Configuration
https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/
Java is the pre-requisite to run hadoop. It should be installed on all Master and slave node.
You can refer the document for Hadoop MultiNode cluster setup for more details.
JDK should be installed on all the nodes as it is the primary requirement for Hadoop to work.
Make sure you install the same version of Java in all the nodes.
Oracle Java is preferred over openjdk

Multi-node hadoop cluster installation

Sorry if my question appears to be naïve. We are planning to use CDH 5.3.0 or 5.4.0. We want to implement a multi-node cluster.
The example multi-node installations that I have seen/read on different blogs/resources have master and slaves on different hosts.
However, we are restrained by the number of hosts. We have only 2 powerful hosts ( 32 cores 400+ GB RAM), so if we decide to have master on one and slave on other, we will end up with only one slave. My questions are :
Is it possible to have master and slave on the same hosts?
Can I have more than one slave node on a single host.
Also does one need to pay to use Cloudera Manager or it is open-source like the rest of the components.
If you can point me in the direction of some resource which would help me understand above scenarios it would be helpful.
Thanks for your help.
Regards,
V
old question but no and wrong answer:
yes, it is possible to install Master & Worker services on a single host.
e.g. HDFS (NameNode and Datanode). You can even install a full cloudera or Hortonworks installation with ALL services on a single host if it is powerfull enough, but i would only recommend it for POC or testcases.
If you use cloudera or hortonworks without virtualization it is not possible to run multiple instances of the SAME worker services e.g. datanode on the same host. 1 Host 1 worker instance. everything else would not make sense.
Cloudera is a package of multiple open source projekt (Hadoop,Spark....) and other closed source parts like cloudera manager and other enterprise closed source features. But everything you need is free even for commercial use with the community licence.
Right now (2017): only cloudera navigator is the big feature which is not part of the community edition
Yes you can configure namenode and datanode both on a single node.
You cannot have more than two datanodes on a single machine.
Cloudera is open-source hadoop distribution.

how can i set a hadoop multinode cluster in my laptop?

Am using CHD3 on centos 5.6 in vmware virtual machine and am facing so many problems while configuring the multinode cluster.
Can anybody provide complete steps to configure multinode cluster??
Configuring a distributed cluster by yourself is a bit of a headache, can be done but it won't worth it if you're just testing and doing some research.
Since you're using Cloudera, you should take a look at Cloudera Manager, which automates the installation and management of Hadoop clusters and should play nice with VMs as well (as long as you have some serious RAM):
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-8-2/Cloudera-Manager-Introduction/cmi_primer.html
This is the installation guide:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-8-2/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html

Hadoop Client Node Configuration

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
Now i know that hadoop software has to be installed in all those 20 machines.
but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?
I am new to hadoop, so from what I understood:
If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.
An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.
In case it can help someone, here is what I have done to connect to a cluster that I don't administer:
get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"
Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.
Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.
Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.
So based on your requirement the definition and placement of client varies.
I recommend this article.
"Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

Resources