I am new to Hive and Hadoop.
I have implemented a task in hive. For that I have written blocks of queries in java and I am accessing Hive using JDBC from java. (Like Stored Procedure in SQL)
Hive uses Hadoop's Mapreduce for executing each and every query. So do I need to write mapreduce job (Hadoop) for it in java. (I have this doubt because as Hive is using the Hadoop mapreduce then is there any need to implment Mapper and Reducer???)
No, Hive is generating one or more map/reduce jobs from your SQL statements and sends them to the Hadoop cluster.
No, your hive queries will translate to map reduce jobs.
Related
Is there any Hive internal process that connects to reduce or map tasks?
Adding to that!
How does Hive work in relation with MapReduce?
How is the job getting scheduled?
How does the query result return to the hive driver?
For HIVE there is no process to communicate Map/Reduce tasks directly. It's communicates (flow 6.3) with Jobtracker(Application Master in YARN) only for job processing related things once it got scheduled.
This image will give clear understanding about,
How HIVE uses MapReduce as execution engine?
How is the job getting scheduled?
How does the result return to the driver?
Edit: suggested by dennis-jaheruddin
Hive is typically controlled by means of HQL (Hive Query Language)
which is often conveniently abbreviated to Hive.
source
I'm been using hive at my work, when i run a select like this
"Select * from TABLENAME"
hive executes a mapreduce job and when I run
"Select * from TABLENAME LIMIT X" independently of x.
hive doesn't execute mapreduce jobs.
I use hive 1.2.1, HDP 2.3.0, hue 2.6.1 and hadoop 2.7.1
Any ideas about this fact?
Thanks!
Select * from table;
Requires no map nor reduce. There is no filter(where statement) or aggregation function here. This query simply reads from HDFS.
This is hive's essential task. It is just an abstraction to map-reduce jobs. The former facebook engineers had to write 100s of map-reduce jobs for ad-hoc analysis and map-reduce jobs are somewhat a pain in the ass. So they abstracted it by an sql-language that will be translated in map-reduce jobs.
This is the same with Pig (yahoo).
P.S Some queries are so easy that they aren't translated to map-reduce jobs but are executed locally on one node as far as i know
From my understanding, Hbase is the Hadoop database and Hive is the data warehouse.
Hive allows to create tables and store data in it, you can also map your existing HBase tables to Hive and operate on them.
why we should use hbase if hive do all that? can we use hive by itself?
I'm confused :(
So in simple terms, with hive you can fire SQL like queries (with some exceptions) on your table/s and is used in batch operation. While with hbase, you can do real time querying and is based on key value pair.
"why we should use hbase if hive do all that? can we use hive by itself" Because Hive doesn't supports updating your data set. So if you have large analytical processing application use Hive and if you have real time get/set/update request processing, use Hbase.
If I create MR job by using TableMapReduceUtil(Hbase), it seems that hbase scanner feeds data into mapper and converts data from reducer to specific hbase output format to store it in hbase table.
For this reason, I expect hbase mapreduce job will take more time than native MR job.
So, How definitely long does Hbase job take more than native MR?
In regards to reads going through HBase can be 2-3 times slower than native map/reduce that uses files directly.
In the recently announced HBase 0.98 they've added the capability to do map/reduce over HBase snapshots. You can see this presentation for details (slide 7 for API, slide 16 for speed comparison).
In regard to writes you can write into HFiles directly and then bulk load to HBase - however since HBase caches data and does bulk writes you can also tune it up and get comparable or better results
I have a question on UDF's on Hive.
When I am using an UDF in a hive query, does it process the data in a MapReduce manner? For instance when I use the function avg. Does Hive convert the function in mapReduce Jobs?
Bests
In most cases a Hive query will be translated to an map/reduce job (the exceptions are things like SELECT * on a HBase table). Average (avg) is a built in aggregate function and not a UDF but this Hive will process both in a map/reduce job.
Note that future versions of Hive would probably improve on this (see for example this post on Hive Stinger initiative) but as mentioned above, currently it is mostly m/r