MapReduce real life uses - hadoop

I have a doubt that in which cases , MapReduce is chosen over hive or pig.
I know that it is used when
We need indepth filtering of the input data.
working with unstructured data.
Working with graph. ....
But is there any place where we cant use hive, pig or we can work much better with MapReduce and it is used highly in real projects

Hive and Pig are generic solutions and they will have overhead while processing the data. Most of the scenarios it is negligible but in some cases it can be considerable.
If there are many tables that needs to be joined, using Hive and Pig tries to apply generic solution, if you use map reduce after understanding the data, you can come up with more optimal solution.
However map reduce should be treated as kernel. If your solution can be reused else where, it will be better to develop it using map reduce and integrate with Hive/Pig/Sqoop.
Pig can be used to process unstructured data. It will give more flexibility than Hive while processing the data.

Bare MapReduce is not written very often these days. Higher level abstractions such as the two you mentioned are more popular and adequate for query workloads.
Even in scenarios where HiveQL is too restrictive one might seek alternatives such as Cascading or Scalding for low-level batch jobs or the ever more popular Spark.
A primary motivation of using these high level abstractions is because most applications require a sequence of map and reduce phase which the MapReduce APIs leave you on your own to figure out how to serialize data between tasks.

Related

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?
Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. But it requires more time to code and introduce changes.
Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
Get top N elements for each group;
Calculate total per each group and than put that total against each row in the group;
Use Bloom filters for JOIN optimisations;
Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)
Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. But its tables can be read from Pig or Standard MapReduce.
One more thing, Hive and Pig are not well suited to work with hierarchical data.
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.
So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.
HTH
Mapreduce:
Strengths:
works both on structured and unstructured data.
good for writing complex business logic.
Weakness:
long development type
hard to achieve join functionality
Hive :
Strengths:
less development time.
suitable for adhoc analysis.
easy for joins
Weakness :
not easy for complex business logic.
deals only structured data.
Pig
Strengths :
Structured and unstructured data.
joins are easily written.
Weakness:
new language to learn.
converted into mapreduce.
Hive
Pros:
Sql like
Data-base guys love that.
Good support for structured data.
Currently support database schema and views like structure
Support concurrent multi users, multi session scenarios.
Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already
Cons:
Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it.
Hierarchical data is a challenge.
Un-structured data requires udf like component
Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc
Pig:
Pros:
Great script based data flow language.
Cons:
Un-structured data requires udf like component
Not a big community support
MapReudce:
Pros:
Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code.
Most of the times MR yields better performance.
MR support for hierarchical data is great especially implement tree like structures.
Better control at partitioning / indexing the data.
Job chaining.
Cons:
Need to know api very well to get a better performance etc
Code / debug / maintain
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG
When you need definite driver program control
Whenever the job requires implementing a custom Partitioner
If there already exists pre-defined library of Java Mappers or Reducers for a job
If you require good amount of testability when combining lots of large data sets
If the application demands legacy code requirements that command physical structure
If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining
If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins
Pros of Pig/Hive :
Hadoop MapReduce requires more development effort than Pig and Hive.
Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program.
When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch.
There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive.
Have a look at this post for Pig Vs Hive comparison.
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.
Here is the great comparison.
It specifies all the use case scenarios.

Performance comparison : Hive & MapReduce

Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Thanks
Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do
I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.
If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.

Using Pig/Hive for data processing instead of direct java map reduce code?

(Even more basic than Difference between Pig and Hive? Why have both?)
I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.
What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?
Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
The above reference also talks about pros and cons of Pig over developing applications in MapReduce.
As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.
In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).
I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.

What is Google's Dremel? How is it different from Mapreduce?

Google's Dremel is described here. What's the difference between Dremel and Mapreduce?
Dremel and MapReduce are not directly comparable, but rather they are complementary technologies.
MapReduce is not specifically designed for analyzing data - rather it's a software framework that allows a collection of nodes to tackle distributed computational problems for large datasets.
Dremel is a data analysis tool designed to quickly run queries on massive, structured datasets (such as log or event files). It supports a SQL-like syntax, but apart from table appends, it is read-only. It doesn't support update or create functions, nor does it feature table indexes. Data is organized in a "columnar" format, which contributes to very fast query speed. Google's BigQuery product is an implementation of Dremel accessible via RESTful API.
Hadoop (an open source implementation of MapReduce) in conjunction with the "Hive" data warehouse software, also allows data analysis for massive datasets using a SQL-style syntax. Hive essentially turns queries into MapReduce functions. In contrast to using a ColumIO format, Hive attempts to make queries quick by using techniques such as table indexing.
Check this article out. Dremel is the what the future of hive should (and will) be.
The major issue of MapReduce and solutions on top of it, like Pig, Hive etc, is that they have an inherent latency between running the job and getting the answer. Dremel uses a totally novel approach (came out in 2010 in that paper by google) which...
...uses a novel query execution engine based on aggregator trees...
...to run almost realtime , interactive AND adhoc queries both of which MapReduce cannot. And Pig and Hive aren't real time
You should keep an eye on projects coming out of this. Is is pretty new for me too... so any other expert comments are welcome!
Edit: Dremel is what the future of HIVE (and not MapReduce as I mentioned before) should be. Hive right now provides a SQL like interface to run MapReduce jobs. Hive has very high latency, and so is not practical in ad-hoc data analysis. Dremel provides a very fast SQL like interface to the data by using a different technique than MapReduce.
MapReduce is an abstract algorithm for how to split a problem up, distribute it, and combine the results. Dremel appears to be a specific tool for querying and analyzing datasets.

Resources