Hive and User Defined Function's - hadoop

I have a question on UDF's on Hive.
When I am using an UDF in a hive query, does it process the data in a MapReduce manner? For instance when I use the function avg. Does Hive convert the function in mapReduce Jobs?
Bests

In most cases a Hive query will be translated to an map/reduce job (the exceptions are things like SELECT * on a HBase table). Average (avg) is a built in aggregate function and not a UDF but this Hive will process both in a map/reduce job.
Note that future versions of Hive would probably improve on this (see for example this post on Hive Stinger initiative) but as mentioned above, currently it is mostly m/r

Related

Relationship between Hive and Hadoop MapReduce?

Is there any Hive internal process that connects to reduce or map tasks?
Adding to that!
How does Hive work in relation with MapReduce?
How is the job getting scheduled?
How does the query result return to the hive driver?
For HIVE there is no process to communicate Map/Reduce tasks directly. It's communicates (flow 6.3) with Jobtracker(Application Master in YARN) only for job processing related things once it got scheduled.
This image will give clear understanding about,
How HIVE uses MapReduce as execution engine?
How is the job getting scheduled?
How does the result return to the driver?
Edit: suggested by dennis-jaheruddin
Hive is typically controlled by means of HQL (Hive Query Language)
which is often conveniently abbreviated to Hive.
source

Hive always create mapreduce job

I'm been using hive at my work, when i run a select like this
"Select * from TABLENAME"
hive executes a mapreduce job and when I run
"Select * from TABLENAME LIMIT X" independently of x.
hive doesn't execute mapreduce jobs.
I use hive 1.2.1, HDP 2.3.0, hue 2.6.1 and hadoop 2.7.1
Any ideas about this fact?
Thanks!
Select * from table;
Requires no map nor reduce. There is no filter(where statement) or aggregation function here. This query simply reads from HDFS.
This is hive's essential task. It is just an abstraction to map-reduce jobs. The former facebook engineers had to write 100s of map-reduce jobs for ad-hoc analysis and map-reduce jobs are somewhat a pain in the ass. So they abstracted it by an sql-language that will be translated in map-reduce jobs.
This is the same with Pig (yahoo).
P.S Some queries are so easy that they aren't translated to map-reduce jobs but are executed locally on one node as far as i know

What is the difference between hbase and hive? (Hadoop)

From my understanding, Hbase is the Hadoop database and Hive is the data warehouse.
Hive allows to create tables and store data in it, you can also map your existing HBase tables to Hive and operate on them.
why we should use hbase if hive do all that? can we use hive by itself?
I'm confused :(
So in simple terms, with hive you can fire SQL like queries (with some exceptions) on your table/s and is used in batch operation. While with hbase, you can do real time querying and is based on key value pair.
"why we should use hbase if hive do all that? can we use hive by itself" Because Hive doesn't supports updating your data set. So if you have large analytical processing application use Hive and if you have real time get/set/update request processing, use Hbase.

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there any specification for both which shows clearly the difference between both in terms of applicability and performance?
Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.
From a purely engineering point of view, I find PIG both easier to write and maintain than SQL-like languages. It is procedural, so you apply a bunch of relations to your data one-by-one, and if something fails you can easily debug at intermediate steps, and even have a command called “illustrate” which uses an algorithm to sample some data matching your relation. I’d say for jobs with complex logic, this is definitely much more convenient than Hive, but for simple stuff the gain is probably minimal.
Regarding interfacing, I find that PIG offers a lot of flexibility compared to Hive. You don’t have a notion of table in PIG so you manipulate files directly, and you can define loader to load it into pretty much any format very easily with loader UDFs, without having to go through the table loading stage before you can do your transformations. They have a nice feature in the recent versions of PIG where you can use dynamic invokers, i.e. use pretty much any Java method directly in your PIG script, without having to write a UDF.
For performance/optimization, from what I’ve seen you can directly control in PIG the type of join and grouping algorithm you want to use (I believe 3 or 4 different algorithms for each). I’ve personally never used it, but as you’re writing demanding algorithms it could probably be useful to be able to decide what to do instead of relying on the optimizer as it’s the case in Hive. So I wouldn’t say it necessarily performs better than Hive, but in cases where the optimizer makes the wrong decision, you have the option to choose what algorithm to use and have more control on what happens.
One of the cool things I did lately was splits: you can split your execution flow and apply different relations to each split. So you can have a non-linear dataset, split it based on a field, and apply a different processing to each part, and maybe join the results together in the end, all this in the same script. I don’t think you can do this in Hive, you’d have to write different queries for each case, but I may be wrong.
One thing to note also is that you can increment counters in PIG. Currently you can only do this in PIG UDFs though. I don’t think you can use counters in Hive.
And there are some nice projects that allow you to interface PIG with Hive as well (like HCatalog), so you can basically read data from a hive table, or write data to a hive table (or both) by simply changing your loader in the script. Supports dynamic partitions as well.
Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.
Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.
https://cwiki.apache.org/confluence/display/PIG/Index%3bjsessionid=F92DF7021837B3DD048BF9529A434FDA
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
https://cwiki.apache.org/Hive/
What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work.
Have a look at Pig Vs Hive Comparison in a nut shell from dezyre article
Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
Hive is best for structured Data & PIG is best for semi structured data
Hive used for reporting & PIG for programming
Hive used as a declarative SQL & PIG used as procedural language
Hive supports partitions & PIG does not
Hive can start an optional thrift based server & PIG can't
Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database
Hive does not support Avro but PIG does
Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically
So when to use and which technology?
Above difference clarifies your query.
HIVE : Structured data, SQL like queries and used for reporting purpose
PIG: Semi-structured data, program a work-flow involving a sequence of activities for Map Reduce jobs.
Regarding performance of job, both HIVE and PIG are slow compared to traditional Map Reduce job. Reason : Finally Hive or PIG scripts have to be converted into a series of Map Reduce jobs.
Have a look at related SE question:
Pig vs Hive vs Native Map Reduce
The main difference is PIG is a data flow language and Hive is data warehouse.
As PIG can be used similar as a step by step procedural language.
But HIVE is used as a declarative language.
PIG can be used for getting online streaming unstructured data. But HIVE can only access structured data and it can also access data from RDBMS databases such as SQL, NOSQL by using JDBC and ODBC drivers.
PIG can convert data into Avro format but PIG can't.
PIG can't create partitions but HIVE can do it.
As HIVE is top of PIG that's why HIVE can only access the data once it is processed by PIG.
It depends when we have to use PIG and HIVE if you are working structured, relational data then we can use HIVE else we can use PIG.
By PIG we can communicate with ETL tools but it takes more time compared with hive. But it is easy in PIG rather HIVE because in HIVE we have to create table before processing the data.

About Hadoop's map-reduce

I am new to Hive and Hadoop.
I have implemented a task in hive. For that I have written blocks of queries in java and I am accessing Hive using JDBC from java. (Like Stored Procedure in SQL)
Hive uses Hadoop's Mapreduce for executing each and every query. So do I need to write mapreduce job (Hadoop) for it in java. (I have this doubt because as Hive is using the Hadoop mapreduce then is there any need to implment Mapper and Reducer???)
No, Hive is generating one or more map/reduce jobs from your SQL statements and sends them to the Hadoop cluster.
No, your hive queries will translate to map reduce jobs.

Resources