Hadoop components to nodes mapping - what components should be installed where - hadoop

I am considering the following hadoop services for setting up a cluster using HDP 2.1
- HDFS
- YARN
- MapReduce2
- Tez
- Hive
- WebHCat
- Ganglia
- Nagios
- ZooKeeper
There are 3 node types that I can think of
NameNodes (ex: primary, secondary)
Application Nodes (from where I will access hive service most often and also copy code repositories and any other code artifacts)
Data Nodes(The workhorses of the cluster)
Given above I know that there are these best practices and common denominators
Zookeeper services should be running on atleast 3 data nodes
DataNode service should be running on all data nodes
Ganglia monitor should be running on all data nodes
Name node service should be running on name nodes
NodeManager should be installed on all nodes containing DataNode component.
This still leaves lots of open questions ex:
which is the ideal node to install a lot the servers needed ex: Hive Server, App Timeline Server, WebHCat Server, Nagios Server, Ganglia Server, MySQL server. Is it Application nodes? should each get its own node? should we have a separate 'utilities' node?
is there some criterion to choose where zookeeper should be installed?
I thinking the more generic question is there a table with "Hadoop components to nodes mapping essentially what components should be installed where"
Seeking advice/insight/links or documents on this topic.

Related

How to connect to Impala when running Hadoop Cluster with Edge Nodes

I have installed Hadoop cluster using Cloudera Manager, and currently Impala Daemon is running on all the data nodes. Cluster is behind the Gateway/Edge nodes, and only gateway services are installed on the Edge Node (e.g httpfs, hive gateway, spark gateway, oozie).
I am wondering, how I can connect Impala from using Gateway/Edge node, as all the impala daemons are running on Cluster's Data nodes, and no service is exposed to Gateway/Edge node.
You could install haproxy on your edge node:
https://www.cloudera.com/documentation/enterprise/5-2-x/topics/impala_proxy.html

Hadoop cluster and client connection

I have the hadoop cluster. Now i want to install the pig and hive on another machines as a client. The client machine will not be a part of that cluster so is it possible? if possible then how i connect that client machine with cluster?
First of all, If you have Hadoop cluster then you must have Master node(Namenode) + Slave node(DataNode)
The one another thing is Client node.
The working of Hadoop cluster is:
Here Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go to Link1 Link2 for client configuration.
According to your question
After complete Hadoop cluster configuration(Master+slave+client). You need to do following steps :
Install Hive and Pig on Master Node
Install Hive and Pig on Client Node
Now Start Coding pig/hive on client node.
Feel free to comment if doubt....!!!!!!

Should the Falcon Prism be installed on separate machine than the existing clusters?

I am trying to understand setup for a Falcon Distributed Cluster.
I am having Cluster A and Cluster B, both with their Falcon Servers (and namenode, oozie, hive etc.). Now, to install the Prism, what would be the best idea? Shall I install it on one of the clusters (different node than falcon server) or on a different machine? If Prism is set on a third cluster (single node) should it have the components like namenode, oozie etc. running too?
Prism will have a config store where the entities are stored. The configstore will typically be on hdfs and hence needs hadoop client.
So, yes the third cluster would need hdfs, namenode etc. Oozie is not necessary.

In a Hbase Cluster Is it advisable to have a ZooKeeper quorum peer on a node that runs a RegionServer?

I have a 4 node hadoop cluster.
namenode.example.com
datanode1.example.com
datanode2.example.com
datanode3.example.com
To have a HBase cluster setup on top of this, I intend to use
namenode.example.com as the hbase master and the 3 datanodes as the region servers.
A fully distributed HBase production setup needs
greater than 1 zookeeper server , and an odd number is recommended.
Hence what would be the disadvantages of having zookeeper servers colocated with the regionservers ?
References
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hbase_cluster_deploy.html
http://hbase.apache.org/0.94/book/zookeeper.html

Hadoop Client Node Configuration

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
Now i know that hadoop software has to be installed in all those 20 machines.
but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?
I am new to hadoop, so from what I understood:
If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.
An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.
In case it can help someone, here is what I have done to connect to a cluster that I don't administer:
get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"
Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.
Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.
Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.
So based on your requirement the definition and placement of client varies.
I recommend this article.
"Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

Resources