In a hadoop cluster, should hive be installed on all nodes? Install Pig - hadoop

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?

You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!

I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

hadoop and its technologies setup

For study project requirement, I am selecting following technology because source of data is SQL SERVER
Initial data size is 100Gb and 10 growth#quarter
Information
Hadoop – Multi node cluster (1Namenode + 3 DataNode)
Hadoop 3.1.2,
Apache Maven 3.6.0
Ubuntu 18.04
Ambari
Above setup is ready now following item remaining
Sqoop: 1.4.7
Hive: 2.3.5
Oozie 5.0.0
Should they all be installed on separate machines?
What is the deployment strategy once development completed?
If you have the hardware available, then yes, every master service should be on separate machines for fault tolerance purposes.
Meaning, Oozie server, Hive server, Hive metastore are all separate.
Sqoop and Hive client are only clients and can be on any NodeManager

Set up hiveserver2 and hive metastore on seperate node

Is it possible to set up hive metastore and hive server2 services on separate nodes? I know that HDP ambari forces you to set up the two on the same node, along with webhcat, I believe, but what about other venders such as Cloudera? and others?
hiveserver and hive-metastore server are independent daemons that can be run different nodes. A thrift based connection is used for communication. In Cloudera distribution and MapR we have an option, I'nk Hortonworks also should include.

Hive Server doesn't see old hdfs tables

I'm having a problem about hive server that I don't understand. I've just set up a hadoop cluster and want to access to it from a hive service. First try I did was running the hive server in one of the cluster machines.
Everything worked nicely but I wanted to move the hive service to another machine outside the hadoop cluster.
So I just started a new machine outside this hadoop cluster. I've just install hive (+ hadoop libraries) and copied the hadoop config from the cluster. When I run the hiveserver almost everything goes ok. I can connect with the hive cli from a different machine to my hiveserver, create new tables in the hive warehouse within the hdfs filesystem in the hadoop cluster, query then and so on.
The thing I don't understand is that hiveserver seems to not recognize old tables which were created in my first try.
Some notes about my config are that all tables are handled by Hive and stored in HDFS. Hive configuration is the default one. I suppose that it has to do with my hive metastore but it couldn't say what.
Thank you!!

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Resources