In a hadoop cluster, should hive be installed on all nodes? - hadoop

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?

No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.

From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.

Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

How is Hadoop different from database?

I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?
Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.

In a hadoop cluster, should hive be installed on all nodes? Install Pig

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?
You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!
I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

Is it possible to run Hadoop queries once Hive is installed?

I'm new to Hadoop, but I have been trying to create a single-node cluster for a college project. Just to set a context to my question, my projetct goal is to perform mapreduce jobs into the same data but while using different Hadoop-based software, these being Hive and Pig.
So, I wanted to know if, once I have a Hadoop running with Hive installed, how can I differ its commands? Since Hive is set, the node is its?
Yes, regular MapReduce jobs can still be run against the Hadoop cluster after Hive has been installed. Hive itself just converts the HiveQL queries into MapReduce jobs that it submits.

Consilidating multiple Hadoop clusters

We have multiple hadoop cluster using hive and pig, what is the best way to consolidate them in to one? In BI this was done by build EDW or MDM approach. how about hadoop is any one thinking about this
Pig and Hive are client side libraries, so there is nothing to migrate for them. Only the Pig and Hive client have to point to the appropriate Hadoop clusters.
Regarding the data, you could use DistCp to move the data across clusters.

Resources