Relationship between Hive and Hadoop MapReduce? - hadoop

Is there any Hive internal process that connects to reduce or map tasks?
Adding to that!
How does Hive work in relation with MapReduce?
How is the job getting scheduled?
How does the query result return to the hive driver?

For HIVE there is no process to communicate Map/Reduce tasks directly. It's communicates (flow 6.3) with Jobtracker(Application Master in YARN) only for job processing related things once it got scheduled.
This image will give clear understanding about,
How HIVE uses MapReduce as execution engine?
How is the job getting scheduled?
How does the result return to the driver?
Edit: suggested by dennis-jaheruddin
Hive is typically controlled by means of HQL (Hive Query Language)
which is often conveniently abbreviated to Hive.
source

Related

Impala vs Hive. How Impala circumvents MapReduce?

How is Impala able to achieve lower latency than Hive in query processing?
I was going through http://impala.apache.org/overview.html, where it is stated:
To avoid latency, Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that is very
similar to those found in commercial parallel RDBMSs. The result is
order-of-magnitude faster performance than Hive, depending on the type
of query and configuration.
How Impala fetches the data without MapReduce (as in Hive)?
Can we say that Impala is closer to HBase and should be compared with HBase instead of comparing with Hive?
Edit:
Or can we say that as classically, Hive is on top of MapReduce and does require less memory to work on while Impala does everything in memory and hence it requires more memory to work by having the data already being cached in memory and acted upon on request?
Just read Impala Architecture and Components
Impala is a massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts.... Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries.
It circumvents MapReduce containers by having a long running daemon on every node that is able to accept query requests. There is no singular point of failure that handles requests like HiveServer2; all impala engines are able to immediately respond to query requests rather than queueing up MapReduce YARN containers.
Impala however does rely on the Hive Metastore service because it is just a useful service for mapping out metadata stored in the RDBMS to the Hadoop filesystem. Pig, Spark, PrestoDB, and other query engines also share the Hive Metastore without communicating though HiveServer.
Data is not "already cached" in Impala. Similar to Spark, you must read the data into a large portion of memory in order for operations to be quick. Unlike Spark, the daemons and statestore services remain active for handling subsequent queries.
Impala can query HBase, but it is not similar in architecture and in my experience, a well designed HBase table is faster to query than Impala. Impala is probably closer to Kudu.
Also worth mentioning that it's not really recommended to use MapReduce Hive anymore. Tez is far better, and Hortonworks states Hive LLAP is better than Impala, although as you quoted, it largely "depends on the type of query and configuration."
Impala use "Impala Daemon" service to read data directly from the dataNode (it must be installed with the same hosts of dataNode) .he cache only the location of files and some statistics in memory not the data itself.
that why impala can't read new files created within the table . you must invalidate or refresh (depend on your case) to tell impala to cache the new files and be able to read them directly
since impala is in memory , you need to have enough memory for the data read by the query , if you query will use more data than your memory (complexe query with aggregation on huge tables),use hive with spark engine not the default map reduce
set hive.execution.engine=spark; just before the query
you can use the same query in hive with spark engine
impala is cloudera product , you won't find it for hortonworks and MapR (or others) .
Tez is not included with cloudera for exemple.
it all depends on the platform you are using

What is the use of Hive's LLAP when there is Hive TEZ?

In our project, we load the data from Greenplum database to HDFS (HIVE). Lately, I came to know that there is a new bundle with Hive2, 'LLAP'. I have been confused with the concept of LLAP.
What is the exact use of LLAP ? When we already have Hive's TEZ Engine, what is the use of LLAP ? A developer in our project told me that we are using Hive LLAP to load the data into HDFS Hive tables. Is it a good practice to use LLAP ? If not, why is it not ?
Could anyone give me some clarity on the above queries ?
https://cwiki.apache.org/confluence/display/Hive/LLAP is a good place to learn about Hive Live Long And Process (LLAP).
As the link says
LLAP works within existing, process-based Hive execution to preserve the scalability and versatility of Hive. It does not replace the existing execution model but rather enhances it.
and
LLAP is not an execution engine (like MapReduce or Tez)
Rather, it provides a long-lived daemon (hence the LL part of the acronym) to replace interactions with the DataNode, and this daemon also provides caching, pre-fetching, and some query processing. This allows simple queries to be largely processed by the daemon itself, with more complex queries being performed in YARN containers as usual.
The link also shows how Tez AM can sit above all of this, and submit Hive tasks which operate via LLAP, which interacts with the DataNode as required. In the example, initial stages of the query are pushed into LLAP, but large shuffles are performed in separate containers.
LLAP nodes are additional layer of nodes ( One LLAP node for one Hadoop Data node) between Tez and Hadoop data node that can cache data and process some queries. Query execution is still scheduled and managed by Tez.
LLAP node have daemons that cache data which can accelerate queries if common data is accessed again and again.
In short it boost performance, you will get very good performance for your queries using LLAP in hive.
Hive can also work without LLAP as well but it can be slower.

Hive always create mapreduce job

I'm been using hive at my work, when i run a select like this
"Select * from TABLENAME"
hive executes a mapreduce job and when I run
"Select * from TABLENAME LIMIT X" independently of x.
hive doesn't execute mapreduce jobs.
I use hive 1.2.1, HDP 2.3.0, hue 2.6.1 and hadoop 2.7.1
Any ideas about this fact?
Thanks!
Select * from table;
Requires no map nor reduce. There is no filter(where statement) or aggregation function here. This query simply reads from HDFS.
This is hive's essential task. It is just an abstraction to map-reduce jobs. The former facebook engineers had to write 100s of map-reduce jobs for ad-hoc analysis and map-reduce jobs are somewhat a pain in the ass. So they abstracted it by an sql-language that will be translated in map-reduce jobs.
This is the same with Pig (yahoo).
P.S Some queries are so easy that they aren't translated to map-reduce jobs but are executed locally on one node as far as i know

Native mapreduce VS hbase mapreduce

If I create MR job by using TableMapReduceUtil(Hbase), it seems that hbase scanner feeds data into mapper and converts data from reducer to specific hbase output format to store it in hbase table.
For this reason, I expect hbase mapreduce job will take more time than native MR job.
So, How definitely long does Hbase job take more than native MR?
In regards to reads going through HBase can be 2-3 times slower than native map/reduce that uses files directly.
In the recently announced HBase 0.98 they've added the capability to do map/reduce over HBase snapshots. You can see this presentation for details (slide 7 for API, slide 16 for speed comparison).
In regard to writes you can write into HFiles directly and then bulk load to HBase - however since HBase caches data and does bulk writes you can also tune it up and get comparable or better results

About Hadoop's map-reduce

I am new to Hive and Hadoop.
I have implemented a task in hive. For that I have written blocks of queries in java and I am accessing Hive using JDBC from java. (Like Stored Procedure in SQL)
Hive uses Hadoop's Mapreduce for executing each and every query. So do I need to write mapreduce job (Hadoop) for it in java. (I have this doubt because as Hive is using the Hadoop mapreduce then is there any need to implment Mapper and Reducer???)
No, Hive is generating one or more map/reduce jobs from your SQL statements and sends them to the Hadoop cluster.
No, your hive queries will translate to map reduce jobs.

Resources