Performance comparison : Hive & MapReduce - hadoop

Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Thanks

Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do

I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.

If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.

Related

MapReduce real life uses

I have a doubt that in which cases , MapReduce is chosen over hive or pig.
I know that it is used when
We need indepth filtering of the input data.
working with unstructured data.
Working with graph. ....
But is there any place where we cant use hive, pig or we can work much better with MapReduce and it is used highly in real projects
Hive and Pig are generic solutions and they will have overhead while processing the data. Most of the scenarios it is negligible but in some cases it can be considerable.
If there are many tables that needs to be joined, using Hive and Pig tries to apply generic solution, if you use map reduce after understanding the data, you can come up with more optimal solution.
However map reduce should be treated as kernel. If your solution can be reused else where, it will be better to develop it using map reduce and integrate with Hive/Pig/Sqoop.
Pig can be used to process unstructured data. It will give more flexibility than Hive while processing the data.
Bare MapReduce is not written very often these days. Higher level abstractions such as the two you mentioned are more popular and adequate for query workloads.
Even in scenarios where HiveQL is too restrictive one might seek alternatives such as Cascading or Scalding for low-level batch jobs or the ever more popular Spark.
A primary motivation of using these high level abstractions is because most applications require a sequence of map and reduce phase which the MapReduce APIs leave you on your own to figure out how to serialize data between tasks.

How exactly Impala is faster than hive?

There are multiple tools built to access data from Hadoop.
Very popular amongst them are Hive and Impala. While Impala was built to address batch nature of Hive (for low cost SQLs), Impala cannot eliminate MapReduce completely as its really great a framework for dealing with batch data.
For low cost SQLs Impala gives dramatically great performance as it skips MapReduce jobs.
What exactly causes Impala to be faster than Hive? Is it in memory execution? Or is efficient and intelligent usage of existing hardware (named nodes and data nodes)?

How is Apache Spark different from the Hadoop approach?

Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.
I didn't understand from the Spark documentation what the real difference is.
Where does Spark stores the data in memory while Hadoop doesn't?
What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?
Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.
It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.
Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
In Hadoop MapReduce the input data is on disk, you perform a map and a reduce and put the result back on disk. Apache Spark allows more complex pipelines. Maybe you need to map twice but don't need to reduce. Maybe you need to reduce then map then reduce again. The Spark API makes it very intuitive to set up very complex pipelines with dozens of steps.
You could implement the same complex pipeline with MapReduce too. But then between each stage you write to disk and read it back. Spark avoids this overhead when possible. Keeping data in-memory is one way. But very often even that is not necessary. One stage can just pass the computed data to the next stage without ever storing the whole data anywhere.
This is not an option with MapReduce, because one MapReduce does not know about the next. It has to complete fully before the next one can start. That is why Spark can be more efficient for complex computation.
The API, especially in Scala, is very clean too. A classical MapReduce is often a single line. It's very empowering to use.

Time taken by MapReduce jobs

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?
For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.
it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.

query reg hbase

As we learnt hadoop is meant for batch processing of data. If we want to go for some trending based on the results produced by hadoop mapreduce jobs, what is the best way. How can we retrive mapreduce results for trending.
Is hbase can be used here. If so, is hbase is having all the capabilities of filtering and aggregate functions on the data stored in hbase?
Thanks
MRK
While there is now perfect solution in hadoop word for this problem, there are a few approaches to solve this kind of problems:
a) To produce some "on demand DataMart" using MR, load it into the RDBMS and run your queries in a real time. It can work if this data subset is much smaller then whole data set.
b) To use MPP database integrated with Hadoop. For example GreenPlum HD has MPP database pre-integrated with hadoop.
c) To use some more light-weight MR framework : Spark. It will have much less latency, but expect your data sets to be comparable with the RAM.
You probably want to look at Hive.

Resources