Is it possible to run Hadoop queries once Hive is installed? - hadoop

I'm new to Hadoop, but I have been trying to create a single-node cluster for a college project. Just to set a context to my question, my projetct goal is to perform mapreduce jobs into the same data but while using different Hadoop-based software, these being Hive and Pig.
So, I wanted to know if, once I have a Hadoop running with Hive installed, how can I differ its commands? Since Hive is set, the node is its?

Yes, regular MapReduce jobs can still be run against the Hadoop cluster after Hive has been installed. Hive itself just converts the HiveQL queries into MapReduce jobs that it submits.

Related

Purpose of using HBase in Hadoop instead of Hive [duplicate]

This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.

Does Hive depend on/require Hadoop?

Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.

Hadoop and Cassandra integration - substituting HDFS with Cassandra

I want to efficiently develop a Hadoop job using Cassandra as input and output.
As I know MapReduce jobs in Hadoop uses HDFS to store intermediate results.
Is it possible to make Hadoop store intermediate results in Cassandra File System? If yes then how to achieve that?
I wonder if it possible to completely disable HDFS if I am using Hadoop only Cassandra as underlying data storage system.
I am using Cassandra 2.0.11 and Hadoop 1.0.4 (If the above is possible only in Hadoop 2.x I would also apreciate that information)

Consilidating multiple Hadoop clusters

We have multiple hadoop cluster using hive and pig, what is the best way to consolidate them in to one? In BI this was done by build EDW or MDM approach. how about hadoop is any one thinking about this
Pig and Hive are client side libraries, so there is nothing to migrate for them. Only the Pig and Hive client have to point to the appropriate Hadoop clusters.
Regarding the data, you could use DistCp to move the data across clusters.

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Resources