Set up hiveserver2 and hive metastore on seperate node - hadoop

Is it possible to set up hive metastore and hive server2 services on separate nodes? I know that HDP ambari forces you to set up the two on the same node, along with webhcat, I believe, but what about other venders such as Cloudera? and others?

hiveserver and hive-metastore server are independent daemons that can be run different nodes. A thrift based connection is used for communication. In Cloudera distribution and MapR we have an option, I'nk Hortonworks also should include.

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

In a hadoop cluster, should hive be installed on all nodes? Install Pig

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?
You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!
I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

Hadoop cluster and client connection

I have the hadoop cluster. Now i want to install the pig and hive on another machines as a client. The client machine will not be a part of that cluster so is it possible? if possible then how i connect that client machine with cluster?
First of all, If you have Hadoop cluster then you must have Master node(Namenode) + Slave node(DataNode)
The one another thing is Client node.
The working of Hadoop cluster is:
Here Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go to Link1 Link2 for client configuration.
According to your question
After complete Hadoop cluster configuration(Master+slave+client). You need to do following steps :
Install Hive and Pig on Master Node
Install Hive and Pig on Client Node
Now Start Coding pig/hive on client node.
Feel free to comment if doubt....!!!!!!

Should the Falcon Prism be installed on separate machine than the existing clusters?

I am trying to understand setup for a Falcon Distributed Cluster.
I am having Cluster A and Cluster B, both with their Falcon Servers (and namenode, oozie, hive etc.). Now, to install the Prism, what would be the best idea? Shall I install it on one of the clusters (different node than falcon server) or on a different machine? If Prism is set on a third cluster (single node) should it have the components like namenode, oozie etc. running too?
Prism will have a config store where the entities are stored. The configstore will typically be on hdfs and hence needs hadoop client.
So, yes the third cluster would need hdfs, namenode etc. Oozie is not necessary.

Configuring Hadoop, HBase and Hive Cluster

I am a newbie to Hadoop, HBase and Hive. I installed Hadoop, HBase and Hive in pseudodistributed mode and everything works fine.
Now I am planning to set up an simple Hadoop Cluster (5 nodes) with Hive, HBase and ZooKeeper. I´ve read several documentations and instructions before but i could not find a good explanation for my question. I´m not sure, where to run all the daemons. This is my consideration:
Node_1 (Master)
NameNode
JobTrakcer
HBase Master
ZooKeeper (Standalone node; managed by HBase)
Node_2 (Backup_Master)
SecondaryNameNode
Node_3 (Slave1)
DataNode1
TaskTracker1
RegionServer1
Node_4 (Slave2)
DataNode2
TaskTracker2
RegionServer2
Node_5 (Slave3)
DataNode3
TaskTracker3
RegionServer3
I know, in production it is recommended to run ZooKeeper ensemble at an odd number of nodes (seperate Cluster). But for a simple cluster, is it OK to set up a standalone ZooKeeper node which runs on the master node?
Another question is regarding Hive: I know that Hive is a Hadoop client. Should I also install Hive on the master node? Does it make sense?
Thanks for all tips and comments!
Hakan
Note: I have just 5 machines to simulate a cluster.
For testing purposes, I believe you can setup Zookeeper on the master node; I did install all of them on the same server.
What I do not understand from your question why you installed hadoop in pseudo distributed mode if you have 5 machines in your cluster? it might be better to install a fully distributed mode.
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=
#iTech : That´s right. If you install hive, you have to set the variable "HADOOP_HOME" to your hadoop installation path.But that´s not the problem..As I said, I worked before with Hadoop and Hive in pseudo distributed mode.
The only problem is, I´m not sure where to run the all the daemons in a 5-node-cluster in fully distributed mode. I´m confused because I want to run a lot of Tools together (Hadoop, HBase and Hive)
Hope that someone have a good tip...
If you are planning to use the described cluster for testing purposes, it is OK to have all your master nodes on the same server. Also you can move the SecondaryNameNode role to Node_1, since SecondaryNameNode is not a backup server for the NameNode, it is there to make checkpoints of your NameNode. So it makes sense to use the Node_2 as another "worker" node in you cluster, or the HiveServer2 and the metastore.
Hope this will help.

Resources