How to connect to Impala when running Hadoop Cluster with Edge Nodes - hadoop

I have installed Hadoop cluster using Cloudera Manager, and currently Impala Daemon is running on all the data nodes. Cluster is behind the Gateway/Edge nodes, and only gateway services are installed on the Edge Node (e.g httpfs, hive gateway, spark gateway, oozie).
I am wondering, how I can connect Impala from using Gateway/Edge node, as all the impala daemons are running on Cluster's Data nodes, and no service is exposed to Gateway/Edge node.

You could install haproxy on your edge node:
https://www.cloudera.com/documentation/enterprise/5-2-x/topics/impala_proxy.html

Related

In a Hadoop cluster, should Hive LLAP daemons work on datanodes or on dedicated nodes?

In a Hadoop cluster, should Hive LLAP daemons work on datanodes or on dedicated nodes?(as the only service on node) Impala daemons are suggested to be installed on each datanode, however this article suggests instaling LLAP daemons on dedicated nodes
LLAP instances have dedicated Hive daemons for faster access. While you could share resources between regular node managers and LLAP daemons you would get better performance if all of the node is dedicated to it.
Note that you still need to run node manager on the node because LLAP Daemons are containers and not actual OS processes. You can also run data node on the machine if you have storage on the node that you'd like to use.
TL;DR - Run on a few dedicated nodes for better performance

Hadoop cluster and client connection

I have the hadoop cluster. Now i want to install the pig and hive on another machines as a client. The client machine will not be a part of that cluster so is it possible? if possible then how i connect that client machine with cluster?
First of all, If you have Hadoop cluster then you must have Master node(Namenode) + Slave node(DataNode)
The one another thing is Client node.
The working of Hadoop cluster is:
Here Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go to Link1 Link2 for client configuration.
According to your question
After complete Hadoop cluster configuration(Master+slave+client). You need to do following steps :
Install Hive and Pig on Master Node
Install Hive and Pig on Client Node
Now Start Coding pig/hive on client node.
Feel free to comment if doubt....!!!!!!

In a Hbase Cluster Is it advisable to have a ZooKeeper quorum peer on a node that runs a RegionServer?

I have a 4 node hadoop cluster.
namenode.example.com
datanode1.example.com
datanode2.example.com
datanode3.example.com
To have a HBase cluster setup on top of this, I intend to use
namenode.example.com as the hbase master and the 3 datanodes as the region servers.
A fully distributed HBase production setup needs
greater than 1 zookeeper server , and an odd number is recommended.
Hence what would be the disadvantages of having zookeeper servers colocated with the regionservers ?
References
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hbase_cluster_deploy.html
http://hbase.apache.org/0.94/book/zookeeper.html

Ganglia fails to communicate with Apache HBase

I installed Ganglia to monitor the HBase cluster. I'm using ganglia-3.3.0.
Hadoop version: hadoop-1.1.2
HBase version : hbase-0.94.8
My Hadoop cluster comprises of 1 master node and 2 slave nodes.
Ganglia gmetad_server is configured on the master node
I changed the hbase/conf/hadoop-metrics.properties file.
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=hostname_of_ganglia_server:8649
I started the service gmond on the master as well as slaves.
I get the basic metrics from the cluster (cpu, disk, load, ...)
But I'm not getting any HBase metrics from the Cluster.
The mistake was with the gmond.conf file. When I commented the following values, I got the HBase metrics in Ganglia.
mcast_join = 239.2.11.71
bind = 239.2.11.71

Hadoop components to nodes mapping - what components should be installed where

I am considering the following hadoop services for setting up a cluster using HDP 2.1
- HDFS
- YARN
- MapReduce2
- Tez
- Hive
- WebHCat
- Ganglia
- Nagios
- ZooKeeper
There are 3 node types that I can think of
NameNodes (ex: primary, secondary)
Application Nodes (from where I will access hive service most often and also copy code repositories and any other code artifacts)
Data Nodes(The workhorses of the cluster)
Given above I know that there are these best practices and common denominators
Zookeeper services should be running on atleast 3 data nodes
DataNode service should be running on all data nodes
Ganglia monitor should be running on all data nodes
Name node service should be running on name nodes
NodeManager should be installed on all nodes containing DataNode component.
This still leaves lots of open questions ex:
which is the ideal node to install a lot the servers needed ex: Hive Server, App Timeline Server, WebHCat Server, Nagios Server, Ganglia Server, MySQL server. Is it Application nodes? should each get its own node? should we have a separate 'utilities' node?
is there some criterion to choose where zookeeper should be installed?
I thinking the more generic question is there a table with "Hadoop components to nodes mapping essentially what components should be installed where"
Seeking advice/insight/links or documents on this topic.

Resources